New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] Grapheme emoji support in std.uni #8665
Conversation
Thanks for your pull request and interest in making D better, @rikkimax! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please see CONTRIBUTING.md for more information. If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment. Bugzilla references
Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub run digger -- build "master + phobos#8665" |
I've had a read of the grapheme break algorithm, I think the grapheme stride function could do with a bit of a minor rewrite to include what rules are being applied. The generator stuff is done, but yeah its failing, there is still work left to be done. |
tools/unicode_table_generator.d
Outdated
@@ -921,6 +927,13 @@ void writeGraphemeTries(File sink) | |||
writeBest3Level(sink, "hangulLVT", hangul.table["LVT"]); | |||
writeBest3Level(sink, "mc", props["Mc"]); | |||
writeBest3Level(sink, "graphemeExtend", props["Grapheme_Extend"]); | |||
|
|||
// emoji related data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inaccurate comment. Prepend and control tables aren't (mostly) related to emojis, only the extended pictographic table is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjust as required, this PR exists to unblock you :)
This PR is still using the existing definitions of SpacingMark and Extend. They are incorrect: if you look at the standard, Extend for intents of grapheme walking does not consist only of Grapheme_Extend=true from the props file (it also includes all extended pictograms). SpacingMark is even more complicated, it has over a dozen of special cases. That's why my PR used the definition from the auxiliary file. EDIT: because these require the generator file is run as written by me, please rerun it without changes to the generator here so I can use the generated file. |
Your changes didn't work. I was trying to reverse-engineer what you actually wanted from it and it seems I missed a couple, once that is sorted you should be unblocked. |
Hmm, I wonder why. I did generate 64-bit tables with those to test the changes locally. Maybe I made some mistake with the version control. I'll retry and tell if they still work for me. |
The problem wouldn't show up in the generator. You renamed some tables (I mentioned this in your PR), without the corresponding changes in std.uni.package. It's best to only emit the tables actually needed so emitting all of them was a bit overkill and a waste on the binary size. EDIT: okay maybe you did change anyway so I undid some ok work, but still bloat is a worry with Unicode tables. |
They are enum tables so should not be a binary size problem. Maybe a compile time problem though. I can make the generator filter unused tables out if you feel it's worth it. It's easy to respectively remove those manually from the generated file. |
If they are not referenced, well I see your point, it just increases compile times a little bit. |
Besides, it's only the smaller tables that are unused. All the large ones need to be used anyway. Though since they are tries the size difference is not that big. |
We compress them too. One of the reasons I think std.uni is so expensive to import (due to decompressing at CTFE). |
Now looks good. Thanks! |
Assuming this is green now: Not to sound vain (since I redid a bunch of it) but I'm happy with the table generator. I'm not happy with the state of the grapheme decoder. It needs to be documented, which rules are being used and where ext. A general clean up would be good too. Right now I would actually have to study the algorithm to understand what it is doing which isn't good for the next person who reads it (i.e. to implement backwards which is what triggered me to look into this set of work as someone wanted it). Basically, I want to be able to look at it and go, yup everything is there LGTM. Which in turn makes it easier for those who can pull too. |
Understandable. It was a |
Okay it's green, lemmie know once you have pushed the code into your PR's branch (copy the files, I don't need attribution). I'll close my PR once you have done so (I'll keep my branch just in case you need it). |
Huh? Why would closing prevent me from copying them :D? No reason to leave that to later. |
Original PR from @dukc #8657
Right now I'm unsure if these changes actually do what they say they do as I haven't studied graphemes yet.
But this should now make the CI go green.