Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Grapheme emoji support in std.uni #8665

Closed
wants to merge 5 commits into from

Conversation

rikkimax
Copy link
Contributor

Original PR from @dukc #8657

Right now I'm unsure if these changes actually do what they say they do as I haven't studied graphemes yet.

But this should now make the CI go green.

@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better, @rikkimax! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Auto-close Bugzilla Severity Description
23474 normal Grapheme should end after carriage return if not followed by line feed.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + phobos#8665"

@rikkimax
Copy link
Contributor Author

I've had a read of the grapheme break algorithm, I think the grapheme stride function could do with a bit of a minor rewrite to include what rules are being applied.

The generator stuff is done, but yeah its failing, there is still work left to be done.

@@ -921,6 +927,13 @@ void writeGraphemeTries(File sink)
writeBest3Level(sink, "hangulLVT", hangul.table["LVT"]);
writeBest3Level(sink, "mc", props["Mc"]);
writeBest3Level(sink, "graphemeExtend", props["Grapheme_Extend"]);

// emoji related data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inaccurate comment. Prepend and control tables aren't (mostly) related to emojis, only the extended pictographic table is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust as required, this PR exists to unblock you :)

@dukc
Copy link
Contributor

dukc commented Jan 13, 2023

This PR is still using the existing definitions of SpacingMark and Extend. They are incorrect: if you look at the standard, Extend for intents of grapheme walking does not consist only of Grapheme_Extend=true from the props file (it also includes all extended pictograms). SpacingMark is even more complicated, it has over a dozen of special cases. That's why my PR used the definition from the auxiliary file.

EDIT: because these require the generator file is run as written by me, please rerun it without changes to the generator here so I can use the generated file.

@rikkimax
Copy link
Contributor Author

EDIT: because these require the generator file is run as written by me, please rerun it without changes to the generator here so I can use the generated file.

Your changes didn't work. I was trying to reverse-engineer what you actually wanted from it and it seems I missed a couple, once that is sorted you should be unblocked.

@dukc
Copy link
Contributor

dukc commented Jan 13, 2023

Your changes didn't work. I was trying to reverse-engineer what you actually wanted from it and it seems I missed a couple, once that is sorted you should be unblocked.

Hmm, I wonder why. I did generate 64-bit tables with those to test the changes locally. Maybe I made some mistake with the version control. I'll retry and tell if they still work for me.

@rikkimax
Copy link
Contributor Author

rikkimax commented Jan 13, 2023

The problem wouldn't show up in the generator.

You renamed some tables (I mentioned this in your PR), without the corresponding changes in std.uni.package.

It's best to only emit the tables actually needed so emitting all of them was a bit overkill and a waste on the binary size.

EDIT: okay maybe you did change anyway so I undid some ok work, but still bloat is a worry with Unicode tables.

@dukc
Copy link
Contributor

dukc commented Jan 13, 2023

It's best to only emit the tables actually needed so emitting all of them was a bit overkill and a waste on the binary size.

They are enum tables so should not be a binary size problem. Maybe a compile time problem though. I can make the generator filter unused tables out if you feel it's worth it. It's easy to respectively remove those manually from the generated file.

@rikkimax
Copy link
Contributor Author

They are enum tables so should not be a binary size problem.

If they are not referenced, well I see your point, it just increases compile times a little bit.

@dukc
Copy link
Contributor

dukc commented Jan 13, 2023

Besides, it's only the smaller tables that are unused. All the large ones need to be used anyway. Though since they are tries the size difference is not that big.

@rikkimax
Copy link
Contributor Author

We compress them too. One of the reasons I think std.uni is so expensive to import (due to decompressing at CTFE).

@dukc
Copy link
Contributor

dukc commented Jan 13, 2023

Now looks good. Thanks!

@rikkimax
Copy link
Contributor Author

rikkimax commented Jan 13, 2023

Assuming this is green now:

Not to sound vain (since I redid a bunch of it) but I'm happy with the table generator.

I'm not happy with the state of the grapheme decoder. It needs to be documented, which rules are being used and where ext. A general clean up would be good too. Right now I would actually have to study the algorithm to understand what it is doing which isn't good for the next person who reads it (i.e. to implement backwards which is what triggered me to look into this set of work as someone wanted it).

Basically, I want to be able to look at it and go, yup everything is there LGTM. Which in turn makes it easier for those who can pull too.

@dukc
Copy link
Contributor

dukc commented Jan 13, 2023

I'm not happy with the state of the grapheme decoder. It needs to be documented, which rules are being used and where ext. A general clean up would be good too. Right now I would actually have to study the algorithm to understand what it is doing which isn't good for the next person who reads it (i.e. to implement backwards which is what triggered me to look into this set of work as someone wanted it).

Understandable. It was a goto hairball to begin with, and I definitely didn't help that. I have to come up with an improvement.

@rikkimax
Copy link
Contributor Author

rikkimax commented Jan 13, 2023

Okay it's green, lemmie know once you have pushed the code into your PR's branch (copy the files, I don't need attribution).

I'll close my PR once you have done so (I'll keep my branch just in case you need it).

@dukc
Copy link
Contributor

dukc commented Jan 13, 2023

Huh? Why would closing prevent me from copying them :D? No reason to leave that to later.

@rikkimax rikkimax closed this Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants