[DO NOT MERGE] Grapheme emoji support in std.uni #8665

rikkimax · 2023-01-12T23:09:09Z

Original PR from @dukc #8657

Right now I'm unsure if these changes actually do what they say they do as I haven't studied graphemes yet.

But this should now make the CI go green.

dlang-bot · 2023-01-12T23:09:14Z

Thanks for your pull request and interest in making D better, @rikkimax! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
I have provided a detailed rationale explaining my changes
New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.

If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Auto-close	Bugzilla	Severity	Description
✓	23474	normal	Grapheme should end after carriage return if not followed by line feed.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + phobos#8665"

rikkimax · 2023-01-12T23:46:59Z

I've had a read of the grapheme break algorithm, I think the grapheme stride function could do with a bit of a minor rewrite to include what rules are being applied.

The generator stuff is done, but yeah its failing, there is still work left to be done.

dukc · 2023-01-13T20:58:13Z

tools/unicode_table_generator.d

@@ -921,6 +927,13 @@ void writeGraphemeTries(File sink)
    writeBest3Level(sink, "hangulLVT", hangul.table["LVT"]);
    writeBest3Level(sink, "mc", props["Mc"]);
    writeBest3Level(sink, "graphemeExtend", props["Grapheme_Extend"]);
+
+    // emoji related data


Inaccurate comment. Prepend and control tables aren't (mostly) related to emojis, only the extended pictographic table is.

Adjust as required, this PR exists to unblock you :)

dukc · 2023-01-13T21:04:51Z

This PR is still using the existing definitions of SpacingMark and Extend. They are incorrect: if you look at the standard, Extend for intents of grapheme walking does not consist only of Grapheme_Extend=true from the props file (it also includes all extended pictograms). SpacingMark is even more complicated, it has over a dozen of special cases. That's why my PR used the definition from the auxiliary file.

EDIT: because these require the generator file is run as written by me, please rerun it without changes to the generator here so I can use the generated file.

rikkimax · 2023-01-13T21:12:44Z

EDIT: because these require the generator file is run as written by me, please rerun it without changes to the generator here so I can use the generated file.

Your changes didn't work. I was trying to reverse-engineer what you actually wanted from it and it seems I missed a couple, once that is sorted you should be unblocked.

dukc · 2023-01-13T21:15:16Z

Your changes didn't work. I was trying to reverse-engineer what you actually wanted from it and it seems I missed a couple, once that is sorted you should be unblocked.

Hmm, I wonder why. I did generate 64-bit tables with those to test the changes locally. Maybe I made some mistake with the version control. I'll retry and tell if they still work for me.

rikkimax · 2023-01-13T21:17:51Z

The problem wouldn't show up in the generator.

You renamed some tables (I mentioned this in your PR), without the corresponding changes in std.uni.package.

It's best to only emit the tables actually needed so emitting all of them was a bit overkill and a waste on the binary size.

EDIT: okay maybe you did change anyway so I undid some ok work, but still bloat is a worry with Unicode tables.

dukc · 2023-01-13T21:24:27Z

It's best to only emit the tables actually needed so emitting all of them was a bit overkill and a waste on the binary size.

They are enum tables so should not be a binary size problem. Maybe a compile time problem though. I can make the generator filter unused tables out if you feel it's worth it. It's easy to respectively remove those manually from the generated file.

rikkimax · 2023-01-13T21:26:28Z

They are enum tables so should not be a binary size problem.

If they are not referenced, well I see your point, it just increases compile times a little bit.

dukc · 2023-01-13T21:29:58Z

Besides, it's only the smaller tables that are unused. All the large ones need to be used anyway. Though since they are tries the size difference is not that big.

rikkimax · 2023-01-13T21:36:06Z

We compress them too. One of the reasons I think std.uni is so expensive to import (due to decompressing at CTFE).

dukc · 2023-01-13T22:00:31Z

Now looks good. Thanks!

rikkimax · 2023-01-13T22:10:10Z

Assuming this is green now:

Not to sound vain (since I redid a bunch of it) but I'm happy with the table generator.

I'm not happy with the state of the grapheme decoder. It needs to be documented, which rules are being used and where ext. A general clean up would be good too. Right now I would actually have to study the algorithm to understand what it is doing which isn't good for the next person who reads it (i.e. to implement backwards which is what triggered me to look into this set of work as someone wanted it).

Basically, I want to be able to look at it and go, yup everything is there LGTM. Which in turn makes it easier for those who can pull too.

dukc · 2023-01-13T22:12:43Z

I'm not happy with the state of the grapheme decoder. It needs to be documented, which rules are being used and where ext. A general clean up would be good too. Right now I would actually have to study the algorithm to understand what it is doing which isn't good for the next person who reads it (i.e. to implement backwards which is what triggered me to look into this set of work as someone wanted it).

Understandable. It was a goto hairball to begin with, and I definitely didn't help that. I have to come up with an improvement.

rikkimax · 2023-01-13T22:18:54Z

Okay it's green, lemmie know once you have pushed the code into your PR's branch (copy the files, I don't need attribution).

I'll close my PR once you have done so (I'll keep my branch just in case you need it).

dukc · 2023-01-13T22:21:04Z

Huh? Why would closing prevent me from copying them :D? No reason to leave that to later.

dukc and others added 2 commits December 31, 2022 18:17

Fix issue 23474 - Fixed many issues in grapheme walker

aa298cc

Updated version Ate Eskola (@dukc)'s emoji handling for graphemes

8f2354e

dlang-bot added the Bug Fix label Jan 12, 2023

rikkimax added 2 commits January 13, 2023 12:11

Forgot to rename a couple of tries

e91ba5e

Whitespace to make it fail!

330e863

dukc reviewed Jan 13, 2023

View reviewed changes

Missing two tables & some clarrifications

7b902e4

rikkimax closed this Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Grapheme emoji support in std.uni #8665

[DO NOT MERGE] Grapheme emoji support in std.uni #8665

rikkimax commented Jan 12, 2023

dlang-bot commented Jan 12, 2023

rikkimax commented Jan 12, 2023

dukc Jan 13, 2023

rikkimax Jan 13, 2023

dukc commented Jan 13, 2023 •

edited

rikkimax commented Jan 13, 2023

dukc commented Jan 13, 2023

rikkimax commented Jan 13, 2023 •

edited

dukc commented Jan 13, 2023 •

edited

rikkimax commented Jan 13, 2023

dukc commented Jan 13, 2023

rikkimax commented Jan 13, 2023

dukc commented Jan 13, 2023

rikkimax commented Jan 13, 2023 •

edited

dukc commented Jan 13, 2023

rikkimax commented Jan 13, 2023 •

edited

dukc commented Jan 13, 2023

[DO NOT MERGE] Grapheme emoji support in std.uni #8665

[DO NOT MERGE] Grapheme emoji support in std.uni #8665

Conversation

rikkimax commented Jan 12, 2023

dlang-bot commented Jan 12, 2023

Bugzilla references

Testing this PR locally

rikkimax commented Jan 12, 2023

dukc Jan 13, 2023

Choose a reason for hiding this comment

rikkimax Jan 13, 2023

Choose a reason for hiding this comment

dukc commented Jan 13, 2023 • edited

rikkimax commented Jan 13, 2023

dukc commented Jan 13, 2023

rikkimax commented Jan 13, 2023 • edited

dukc commented Jan 13, 2023 • edited

rikkimax commented Jan 13, 2023

dukc commented Jan 13, 2023

rikkimax commented Jan 13, 2023

dukc commented Jan 13, 2023

rikkimax commented Jan 13, 2023 • edited

dukc commented Jan 13, 2023

rikkimax commented Jan 13, 2023 • edited

dukc commented Jan 13, 2023

dukc commented Jan 13, 2023 •

edited

rikkimax commented Jan 13, 2023 •

edited

dukc commented Jan 13, 2023 •

edited

rikkimax commented Jan 13, 2023 •

edited

rikkimax commented Jan 13, 2023 •

edited