-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Unicode grapheme clusters #11472
Implement Unicode grapheme clusters #11472
Conversation
…ter boundaries of unicode text
@@ -0,0 +1,81 @@ | |||
# This script generates the file src/string/grapheme/properties.cr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to merge this script with the other generate_unicode_data
script so that there's only 1 thing that generates all unicode data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to separate these concerns into individual files. They work with different data. The code is more readable when it's clearly separated.
Instead, I would consider adding a Makefile for generating data. That allows us to easily re-generate data with a single command and manage configuration such as Unicode versions in a single place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Works for me. We also have that Unicode::VERSION
constant we could use at least versus a make env var.
Wow, I was just going to ask today or tomorrow if graphemes were going to be considered for the next release, because I think they are very important and useful. This is great! And thank you so much @naqvis for the original code! 🙏 |
It has been some time since I did most of the work in this PR, just finished it up recently. It mostly just implements the suggestions made in its predecessor. But since then I have thought about the internal representation again. The However, in some contexts, multi-character grapheme clusters are quite ubiquitous. Whether you're dealing with lots of composed emojis or scripts such as Thai, Hebrew or Arabic, or even just many decomposed diacritics, there might be a lot of string allocations for all these graphemes. I'm going to briefly sketch some ideas for performance improvements. Obviously, we can defer such optimizations to a later point when we might already have some real-world applications for reference. However it's helpful to have some ideas in mind while defining the API because that might influence what options we can easily apply later.
|
Is there a common maximum grapheme cluster length that we could take advantage of? For example if the "שָׁ" in "שָׁלוֹם" is composed of two chars, the letter and the mark below, then maybe we could change the internal representation to be: Char | {Char, Char} | String Going with that idea, we could also allow up to three or four chars: Char | {Char, Char} | {Char, Char, Char} | {Char, Char, Char, Char} | String I know this doesn't cover all cases, but maybe it covers most of them? That union is not very nice, so maybe another idea could be to have it be like this: {size: Int32, data: Char[4]} | String That is, either a Char buffer with a size, or a string. That should simplify the internal implementation, and it would also allow us to change that 4 to 3 or 2 later on very easily. |
Yeah, I wondered about that as well. But I don't think there is a technical limit on how many codepoints can form a grapheme. You can stack many modifiers. Maybe we could identify a size that covers a good portion of practically relevant grapheme clusters that would frequently come across. But it's hard to tell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments (WIP)
src/string/grapheme/grapheme.cr
Outdated
# which is accounted for in the state argument. | ||
# | ||
# The implementation is inspired by https://github.com/JuliaStrings/utf8proc/blob/462093b3924c7491defc67fda4bc7a27baf9b088/utf8proc.c#L291 | ||
def self.break?(lbc : Property, tbc : Property, state : Pointer(Property)) : Bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but would it be possible to have a more functional implementation of this function (that is, avoid the Pointer
here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess so. The return value could be a Tuple(Bool, Property)
. Not sure if that would be significantly better.
Co-authored-by: Beta Ziliani <beta@manas.tech>
Co-authored-by: Beta Ziliani <beta@manas.tech>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this looks good. @naqvis, since you're the original creator, would you mind taking a look at @straight-shoota's changes to make sure we're on the same page? Thanks a lot for your effort!
(Same to you @straight-shoota!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @straight-shoota for whole refactoring.
# Returns the number of Unicode extended graphemes clusters in this string. | ||
# | ||
# * `#each_grapheme` iterates the grapheme clusters. | ||
def grapheme_size : Int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@straight-shoota Wouldn't grapheme_count
be a more apt name here, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. We always use size.
This patch adds a representation and query methods for Unicode grapheme clusters as defined in Unicode Standard Annex #29.
The data type
String::Grapheme
represent a grapheme cluster. Instances can be aquired throughString#graphemes
andString#each_grapheme
. This API is purely informational and allows no modifications.This PR builds on top #10721 by @naqvis which is based on the shard https://github.com/naqvis/uni_text_seg. The enhancements trim down a more concise API and optimize some algorithms with inspiration from https://github.com/JuliaStrings/utf8proc.
/ref https://forum.crystal-lang.org/t/unicode-text-segmentation-grapheme-clusters/3274
Closes #10721