Implement Unicode grapheme clusters #11472

straight-shoota · 2021-11-17T20:27:53Z

This patch adds a representation and query methods for Unicode grapheme clusters as defined in Unicode Standard Annex #29.

The data type String::Grapheme represent a grapheme cluster. Instances can be aquired through String#graphemes and String#each_grapheme. This API is purely informational and allows no modifications.

This PR builds on top #10721 by @naqvis which is based on the shard https://github.com/naqvis/uni_text_seg. The enhancements trim down a more concise API and optimize some algorithms with inspiration from https://github.com/JuliaStrings/utf8proc.

/ref https://forum.crystal-lang.org/t/unicode-text-segmentation-grapheme-clusters/3274
Closes #10721

…ter boundaries of unicode text

Blacksmoke16 · 2021-11-17T20:35:51Z

scripts/generate_grapheme_properties.cr

@@ -0,0 +1,81 @@
+# This script generates the file src/string/grapheme/properties.cr


Would it make sense to merge this script with the other generate_unicode_data script so that there's only 1 thing that generates all unicode data?

I think it makes sense to separate these concerns into individual files. They work with different data. The code is more readable when it's clearly separated.

Instead, I would consider adding a Makefile for generating data. That allows us to easily re-generate data with a single command and manage configuration such as Unicode versions in a single place.

👍 Works for me. We also have that Unicode::VERSION constant we could use at least versus a make env var.

asterite · 2021-11-17T20:54:01Z

Wow, I was just going to ask today or tomorrow if graphemes were going to be considered for the next release, because I think they are very important and useful. This is great!

And thank you so much @naqvis for the original code! 🙏

src/string/grapheme.cr

straight-shoota · 2021-11-23T21:20:45Z

It has been some time since I did most of the work in this PR, just finished it up recently. It mostly just implements the suggestions made in its predecessor.

But since then I have thought about the internal representation again.

The Char | String union type is a huge improvement for grapheme clusters consisting of a single character. That should be most common in typical use cases, so it saves a big deal of allocations compared to String (or a collection of Char as in the initial propsal).

However, in some contexts, multi-character grapheme clusters are quite ubiquitous. Whether you're dealing with lots of composed emojis or scripts such as Thai, Hebrew or Arabic, or even just many decomposed diacritics, there might be a lot of string allocations for all these graphemes.

I'm going to briefly sketch some ideas for performance improvements. Obviously, we can defer such optimizations to a later point when we might already have some real-world applications for reference. However it's helpful to have some ideas in mind while defining the API because that might influence what options we can easily apply later.

We can expose the method String#each_grapheme_boundary as part of the public API. That's trivial and basically cost-free. However, I'd consider it more a low-level API. It can be useful for custom grapheme-based implementations.
Employing a StringPool to avoid repeated allocations of identical grapheme clusters. There is no way to release values from a pool, so it would be a bad idea to use a global pool. But individual pools could be useful for single iterations. This could optionally be configurable and the API could allow specifying a custom pool to be used. This would be pretty easy to implement later. We just need to add StringPool parameters to Grapheme.new and the iteration methods.
A pool means we still need to allocate strings for every grapheme, but we can safe on duplicate allocations of the same graphemes. It is likely that the same graphemes show up repeatedly in longer texts of the same context, so that should be expected to be benefitial.
We could avoid allocating a string entirely by keeping a reference of the original string and making Grapheme essentially a pointer to a substring. String allocation would only happen in Grapheme#to_s. The big downside of this approach is that it keeps the original string in memory. That's a serious implication when working with large amounts of texts. If the Grapheme instaces are only used for iteration, this is not a problem. But if you start storing Graphemes somewhere, they keep pointers to the original strings alive. An example would be collecting frequencies of graphemes over a collection of texts. Assuming every evaluated texts contains a multi-character grapheme that was not encountered previously, all these strings will be kept in memory. The user would need to be aware of this behaviour and take appropriate action such as converting grapheme instances to self-owned cluster strings.

asterite · 2021-11-24T11:44:41Z

Is there a common maximum grapheme cluster length that we could take advantage of? For example if the "שָׁ" in "שָׁלוֹם" is composed of two chars, the letter and the mark below, then maybe we could change the internal representation to be:

Char | {Char, Char} | String

Going with that idea, we could also allow up to three or four chars:

Char | {Char, Char} | {Char, Char, Char} | {Char, Char, Char, Char} | String

I know this doesn't cover all cases, but maybe it covers most of them?

That union is not very nice, so maybe another idea could be to have it be like this:

{size: Int32, data: Char[4]} | String

That is, either a Char buffer with a size, or a string. That should simplify the internal implementation, and it would also allow us to change that 4 to 3 or 2 later on very easily.

straight-shoota · 2021-11-24T13:15:31Z

Yeah, I wondered about that as well. But I don't think there is a technical limit on how many codepoints can form a grapheme. You can stack many modifiers.

Maybe we could identify a size that covers a good portion of practically relevant grapheme clusters that would frequently come across. But it's hard to tell.

beta-ziliani

Minor comments (WIP)

src/string/grapheme.cr

src/string/grapheme/grapheme.cr

beta-ziliani · 2021-12-01T19:20:57Z

src/string/grapheme/grapheme.cr

+    # which is accounted for in the state argument.
+    #
+    # The implementation is inspired by https://github.com/JuliaStrings/utf8proc/blob/462093b3924c7491defc67fda4bc7a27baf9b088/utf8proc.c#L291
+    def self.break?(lbc : Property, tbc : Property, state : Pointer(Property)) : Bool


Not a big deal, but would it be possible to have a more functional implementation of this function (that is, avoid the Pointer here)?

I guess so. The return value could be a Tuple(Bool, Property). Not sure if that would be significantly better.

src/string/grapheme/grapheme.cr

Co-authored-by: Beta Ziliani <beta@manas.tech>

src/string/grapheme.cr

src/string/grapheme/grapheme.cr

Co-authored-by: Beta Ziliani <beta@manas.tech>

beta-ziliani

To me this looks good. @naqvis, since you're the original creator, would you mind taking a look at @straight-shoota's changes to make sure we're on the same page? Thanks a lot for your effort!
(Same to you @straight-shoota!)

naqvis

Thanks @straight-shoota for whole refactoring.

Sija · 2021-12-15T11:51:22Z

src/string/grapheme.cr

+  # Returns the number of Unicode extended graphemes clusters in this string.
+  #
+  # * `#each_grapheme` iterates the grapheme clusters.
+  def grapheme_size : Int32


@straight-shoota Wouldn't grapheme_count be a more apt name here, wdyt?

Nope. We always use size.

naqvis and others added 13 commits November 17, 2021 19:18

Add Unicode Text Segmentation Annex#29 to determine the grapheme clus…

8fcd2ea

…ter boundaries of unicode text

Updated API to be consistent with String API

8450fab

Updated API to be consistent with String API - Fixed Specs

b5217cc

Updated API with that in Option 1

551a434

Update to return opaque type

aa6f1d9

changed internal rep to union, moved Iterator to seprate class

b73ac28

changed internal rep to union, moved Iterator to seprate class

3fc5da8

Made misc style and comments changes

295eb4d

Exclude formatter from helper script

d4a881a

Add script to autogenerate Graheme Break Test specs

dfd6d7e

Refactor grapheme break algorithm

de6feb5

Add Grapheme utility methods and specs

029907b

Improve docs

926c550

straight-shoota added kind:feature topic:stdlib:text labels Nov 17, 2021

Blacksmoke16 reviewed Nov 17, 2021

View reviewed changes

asterite reviewed Nov 17, 2021

View reviewed changes

src/string/grapheme.cr Show resolved Hide resolved

Use #each_grapheme_boundary in String#reverse

4156374

beta-ziliani reviewed Dec 1, 2021

View reviewed changes

straight-shoota and others added 3 commits December 1, 2021 22:13

Update src/string/grapheme/grapheme.cr

d4630d0

Co-authored-by: Beta Ziliani <beta@manas.tech>

Improve doc

d5f8c0d

Refactor pointer to tuple return value

b9517c3

beta-ziliani reviewed Dec 2, 2021

View reviewed changes

src/string/grapheme.cr Show resolved Hide resolved

src/string/grapheme/grapheme.cr Outdated Show resolved Hide resolved

Update src/string/grapheme/grapheme.cr

4f3247f

Co-authored-by: Beta Ziliani <beta@manas.tech>

beta-ziliani approved these changes Dec 10, 2021

View reviewed changes

naqvis approved these changes Dec 12, 2021

View reviewed changes

straight-shoota added this to the 1.3.0 milestone Dec 13, 2021

straight-shoota merged commit 4365eed into crystal-lang:master Dec 15, 2021

straight-shoota deleted the feature/graphemes branch December 15, 2021 10:20

Sija reviewed Dec 15, 2021

View reviewed changes

This was referenced Dec 17, 2021

Revert "Formatter: Escape non-printable characters in literals" #11603

Merged

Discussion about Grapheme API #11610

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Unicode grapheme clusters #11472

Implement Unicode grapheme clusters #11472

straight-shoota commented Nov 17, 2021

Blacksmoke16 Nov 17, 2021

straight-shoota Nov 17, 2021

Blacksmoke16 Nov 17, 2021

asterite commented Nov 17, 2021

straight-shoota commented Nov 23, 2021

asterite commented Nov 24, 2021

straight-shoota commented Nov 24, 2021 •

edited

Loading

beta-ziliani left a comment

beta-ziliani Dec 1, 2021

straight-shoota Dec 1, 2021

beta-ziliani left a comment

naqvis left a comment

Sija Dec 15, 2021

straight-shoota Dec 15, 2021

		@@ -0,0 +1,81 @@
		# This script generates the file src/string/grapheme/properties.cr

Implement Unicode grapheme clusters #11472

Implement Unicode grapheme clusters #11472

Conversation

straight-shoota commented Nov 17, 2021

Blacksmoke16 Nov 17, 2021

Choose a reason for hiding this comment

straight-shoota Nov 17, 2021

Choose a reason for hiding this comment

Blacksmoke16 Nov 17, 2021

Choose a reason for hiding this comment

asterite commented Nov 17, 2021

straight-shoota commented Nov 23, 2021

asterite commented Nov 24, 2021

straight-shoota commented Nov 24, 2021 • edited Loading

beta-ziliani left a comment

Choose a reason for hiding this comment

beta-ziliani Dec 1, 2021

Choose a reason for hiding this comment

straight-shoota Dec 1, 2021

Choose a reason for hiding this comment

beta-ziliani left a comment

Choose a reason for hiding this comment

naqvis left a comment

Choose a reason for hiding this comment

Sija Dec 15, 2021

Choose a reason for hiding this comment

straight-shoota Dec 15, 2021

Choose a reason for hiding this comment

straight-shoota commented Nov 24, 2021 •

edited

Loading