Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Unicode grapheme clusters #11472

Merged
merged 18 commits into from
Dec 15, 2021

Conversation

straight-shoota
Copy link
Member

This patch adds a representation and query methods for Unicode grapheme clusters as defined in Unicode Standard Annex #29.

The data type String::Grapheme represent a grapheme cluster. Instances can be aquired through String#graphemes and String#each_grapheme. This API is purely informational and allows no modifications.

This PR builds on top #10721 by @naqvis which is based on the shard https://github.com/naqvis/uni_text_seg. The enhancements trim down a more concise API and optimize some algorithms with inspiration from https://github.com/JuliaStrings/utf8proc.

/ref https://forum.crystal-lang.org/t/unicode-text-segmentation-grapheme-clusters/3274
Closes #10721

@@ -0,0 +1,81 @@
# This script generates the file src/string/grapheme/properties.cr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to merge this script with the other generate_unicode_data script so that there's only 1 thing that generates all unicode data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to separate these concerns into individual files. They work with different data. The code is more readable when it's clearly separated.

Instead, I would consider adding a Makefile for generating data. That allows us to easily re-generate data with a single command and manage configuration such as Unicode versions in a single place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Works for me. We also have that Unicode::VERSION constant we could use at least versus a make env var.

@asterite
Copy link
Member

Wow, I was just going to ask today or tomorrow if graphemes were going to be considered for the next release, because I think they are very important and useful. This is great!

And thank you so much @naqvis for the original code! 🙏

@straight-shoota
Copy link
Member Author

It has been some time since I did most of the work in this PR, just finished it up recently. It mostly just implements the suggestions made in its predecessor.

But since then I have thought about the internal representation again.

The Char | String union type is a huge improvement for grapheme clusters consisting of a single character. That should be most common in typical use cases, so it saves a big deal of allocations compared to String (or a collection of Char as in the initial propsal).

However, in some contexts, multi-character grapheme clusters are quite ubiquitous. Whether you're dealing with lots of composed emojis or scripts such as Thai, Hebrew or Arabic, or even just many decomposed diacritics, there might be a lot of string allocations for all these graphemes.

I'm going to briefly sketch some ideas for performance improvements. Obviously, we can defer such optimizations to a later point when we might already have some real-world applications for reference. However it's helpful to have some ideas in mind while defining the API because that might influence what options we can easily apply later.

  1. We can expose the method String#each_grapheme_boundary as part of the public API. That's trivial and basically cost-free. However, I'd consider it more a low-level API. It can be useful for custom grapheme-based implementations.
  2. Employing a StringPool to avoid repeated allocations of identical grapheme clusters. There is no way to release values from a pool, so it would be a bad idea to use a global pool. But individual pools could be useful for single iterations. This could optionally be configurable and the API could allow specifying a custom pool to be used. This would be pretty easy to implement later. We just need to add StringPool parameters to Grapheme.new and the iteration methods.
    A pool means we still need to allocate strings for every grapheme, but we can safe on duplicate allocations of the same graphemes. It is likely that the same graphemes show up repeatedly in longer texts of the same context, so that should be expected to be benefitial.
  3. We could avoid allocating a string entirely by keeping a reference of the original string and making Grapheme essentially a pointer to a substring. String allocation would only happen in Grapheme#to_s. The big downside of this approach is that it keeps the original string in memory. That's a serious implication when working with large amounts of texts. If the Grapheme instaces are only used for iteration, this is not a problem. But if you start storing Graphemes somewhere, they keep pointers to the original strings alive. An example would be collecting frequencies of graphemes over a collection of texts. Assuming every evaluated texts contains a multi-character grapheme that was not encountered previously, all these strings will be kept in memory. The user would need to be aware of this behaviour and take appropriate action such as converting grapheme instances to self-owned cluster strings.

@asterite
Copy link
Member

Is there a common maximum grapheme cluster length that we could take advantage of? For example if the "שָׁ" in "שָׁלוֹם" is composed of two chars, the letter and the mark below, then maybe we could change the internal representation to be:

Char | {Char, Char} | String

Going with that idea, we could also allow up to three or four chars:

Char | {Char, Char} | {Char, Char, Char} | {Char, Char, Char, Char} | String

I know this doesn't cover all cases, but maybe it covers most of them?

That union is not very nice, so maybe another idea could be to have it be like this:

{size: Int32, data: Char[4]} | String

That is, either a Char buffer with a size, or a string. That should simplify the internal implementation, and it would also allow us to change that 4 to 3 or 2 later on very easily.

@straight-shoota
Copy link
Member Author

straight-shoota commented Nov 24, 2021

Yeah, I wondered about that as well. But I don't think there is a technical limit on how many codepoints can form a grapheme. You can stack many modifiers.

Maybe we could identify a size that covers a good portion of practically relevant grapheme clusters that would frequently come across. But it's hard to tell.

Copy link
Member

@beta-ziliani beta-ziliani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments (WIP)

src/string/grapheme.cr Show resolved Hide resolved
src/string/grapheme/grapheme.cr Outdated Show resolved Hide resolved
# which is accounted for in the state argument.
#
# The implementation is inspired by https://github.com/JuliaStrings/utf8proc/blob/462093b3924c7491defc67fda4bc7a27baf9b088/utf8proc.c#L291
def self.break?(lbc : Property, tbc : Property, state : Pointer(Property)) : Bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but would it be possible to have a more functional implementation of this function (that is, avoid the Pointer here)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so. The return value could be a Tuple(Bool, Property). Not sure if that would be significantly better.

src/string/grapheme/grapheme.cr Show resolved Hide resolved
src/string/grapheme.cr Show resolved Hide resolved
src/string/grapheme/grapheme.cr Outdated Show resolved Hide resolved
Co-authored-by: Beta Ziliani <beta@manas.tech>
Copy link
Member

@beta-ziliani beta-ziliani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this looks good. @naqvis, since you're the original creator, would you mind taking a look at @straight-shoota's changes to make sure we're on the same page? Thanks a lot for your effort!
(Same to you @straight-shoota!)

Copy link
Contributor

@naqvis naqvis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @straight-shoota for whole refactoring.

@straight-shoota straight-shoota added this to the 1.3.0 milestone Dec 13, 2021
@straight-shoota straight-shoota merged commit 4365eed into crystal-lang:master Dec 15, 2021
@straight-shoota straight-shoota deleted the feature/graphemes branch December 15, 2021 10:20
# Returns the number of Unicode extended graphemes clusters in this string.
#
# * `#each_grapheme` iterates the grapheme clusters.
def grapheme_size : Int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@straight-shoota Wouldn't grapheme_count be a more apt name here, wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. We always use size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants