Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement UAX31's quick check algorithm #16383

Closed
wants to merge 1 commit into from

Conversation

rikkimax
Copy link
Contributor

@rikkimax rikkimax commented Apr 14, 2024

PR number 2.

A few changes from the original approach:

  • Still using binary search, but it is now bucketed
  • Not using inverse lists
  • Could not merge the quick check algorithm yes/no/maybe value into the tables

Fibonaccian search unfortunately did not win out in my benchmarks for what I tried and it required a lot of extra ROM for both the standard algorithms.

Two things I want to draw your attention to @WalterBright:

  • In lexer the addition of acquireIdentifierFromString function. I'm going to guess that you won't like this function existing, I don't. But I couldn't find a way to merge the pathways wrt. universal character names + D identifiers. So it had to be split out.
  • In identifier I added idPoolNormalize, there is pseudo code for what that function needs to do once toNFC has been implemented. I don't particularly want to touch string interning itself, so I'd appreciate if you could look at it and give me some feedback about how you would want it to be implemented or do a pass over it to make it possible.

@rikkimax rikkimax requested a review from ibuclaw as a code owner April 14, 2024 15:52
@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better, @rikkimax! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#16383"

@rikkimax rikkimax force-pushed the uax31-quickcheck branch 3 times, most recently from b0f7ebd to a44fc2a Compare April 15, 2024 17:56
@rikkimax
Copy link
Contributor Author

GCC was very helpful in erroring out with ImportC test.

 ... compilable/normalize_silent.c  -identifiers-importc=UAX31 -normalization-importc=ignore  -fPIC ()==============================
Test 'compilable/normalize_silent.c' failed. The logged output:
/home/runner/work/dmd/dmd/generated/linux/release/64/dmd -conf= -m64 -Icompilable -identifiers-importc=UAX31 -normalization-importc=ignore  -fPIC  -od/home/runner/work/dmd/dmd/compiler/test/test_results/compilable/c -of/home/runner/work/dmd/dmd/compiler/test/test_results/compilable/c/normalize_silent_0.o  -c compilable/normalize_silent.c 
compilable/normalize_silent.c:10:5: warning: `s\U00000345\U00000315\U00000300e' is not in NFC [-Wnormalized=]
   10 | int s\u0345\u0315\u0300e;
      |     ^

Had to disable silently accept there.

If we had a way of knowing that the CPP is GCC, we'd be able to turn this warning off and therefore turn the warning back on.

@rikkimax rikkimax force-pushed the uax31-quickcheck branch 3 times, most recently from 0a76991 to 039666a Compare April 15, 2024 18:12
@rikkimax
Copy link
Contributor Author

I have no idea why core.exception.AssertError@src/dmd/common/charactertables.d(141): unittest failure is failing. It isn't doing it locally and if I explicitly error it out, it errors out.

Nor is it doing it in my benchmark testsuite which verifies it against the old search algorithm.

@rikkimax rikkimax force-pushed the uax31-quickcheck branch 2 times, most recently from 1efeb4f to f07a676 Compare April 15, 2024 19:10
@rikkimax
Copy link
Contributor Author

Command: /opt/hostedtoolcache/dc/dmd-2.108.0/x64/dmd2/linux/bin64/dmd -c -of/home/runner/work/dmd/dmd/dmd/generated/linux/release/32/common.o -w -de -fPIC -m32 -J/home/runner/work/dmd/dmd/dmd/generated/linux/release/32 -I/home/runner/work/dmd/dmd/dmd/compiler/src -g -color=on src/dmd/common/bitfields.d src/dmd/common/file.d src/dmd/common/int128.d src/dmd/common/blake3.d src/dmd/common/outbuffer.d src/dmd/common/smallbuffer.d src/dmd/common/charactertables.d src/dmd/common/identifiertables.d

-----------------------------------------------------------
src/dmd/common/charactertables.d(138): Error: static assert:  [862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u]

Oh hello there, so the buckets aren't being computed...

Table itself is ok, but not the computation of the bucket offets.

@rikkimax rikkimax force-pushed the uax31-quickcheck branch 12 times, most recently from 0eaae20 to d29f012 Compare April 15, 2024 20:12
@rikkimax
Copy link
Contributor Author

Looks like there is a bug some place regarding signed vs unsigned integer comparisons at CTFE.

Test 'fail_compilation/normalize_error.c' failed. The logged output:
/home/runner/work/dmd/dmd/generated/linux/release/64/dmd -conf= -m64 -Ifail_compilation -verrors=0 -identifiers-importc=UAX31 -normalization-importc=error  -fPIC  -od/home/runner/work/dmd/dmd/compiler/test/test_results/fail_compilation/c -of/home/runner/work/dmd/dmd/compiler/test/test_results/fail_compilation/c/normalize_error_0.o  -c fail_compilation/normalize_error.c 
fail_compilation/normalize_error.c:18:5: warning: `s\U00000345\U00000315\U00000300e' is not in NFC [-Wnormalized=]
   18 | int s\u0345\u0315\u0300e;
      |     ^

Makes sense to be getting that, preprocessor runs before us. Gonna have to disable it as well, not much I can do without knowing which preprocessor it is.

@rikkimax rikkimax force-pushed the uax31-quickcheck branch 3 times, most recently from 6f5972e to aa04be1 Compare April 15, 2024 20:44
@rikkimax
Copy link
Contributor Author

Looks like there is a bug hiding out for the UAX31_Continue table with the bucketing algorithm.

It'll need to be fixed prior to merging. But that is not a problem for today.

auto sNormalized = toNFC(s);
if (sNormalized in stringtable)
{
free(sNormalized);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mem.xfree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact function doesn't matter it's pseudo code.

What I'm worried about is whether the string interner stringtable is capable of handling the pattern of storing unnormalized as key to normalized value.

Write an identifier one way 100 times, which should result in normalization only the 1 time if it wasn't normalized.

This is something I need feedback on before I start next PR in a few months time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stringtable uses a hash of the bytes in the identifier. It does not attempt normalization. Doing that would slow it way down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that isn't what I am suggesting.

The stringtable is only responsible for storing key to value map. Unlike currently I would need to make the key be a different string to the value.

Turns out this is as simple as adding an extra parameter (key) which gets sent to the hash instead of the value parameter.

The quick check algorithm quickly determines if normalization needs to occur for Unicode, for ASCII its always off.

idPoolNormalize then takes whether it is currently normalized and performs the normalization calling into the stringtable as describe in the pseudo code.

I do believe this is all solved for next PR.

@@ -639,6 +639,26 @@ dmd -cov -unittest myprog.d
`Turns off all array bounds checking, even for safe functions. $(RED Deprecated
(use $(TT $(SWLINK -boundscheck)=off) instead).)`,
),
Option("normalization=<strategy>",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need all these strategies? Doesn't C23 specify behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C23 does not specify. Neither does UAX31, as long as we handle it in some stated way (or pickable) we are good.

We don't need the error case as that is already handled with warning as error. We also don't need to expose deprecation externally.

Next PR would be normalization that would get exposed (currently internal only) and later made the default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pick the simplest way and leave it at that. Why should we do normalization at all? I'm beginning to like Nim's strategy of every non-ASCII code unit is an identifier character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't handle normalization in some form a D identifier may look the same, but won't actually match to a C identifier.

Erroring out basically requires a subset of the normalization process except less well-documented. I haven't implemented the full algorithm, only enough to decrease the cost of normalization to "don't worry about it" levels.

Performing normalization autocorrects, and the pseudo code in idPoolNormalize means you only have to do it once per string representation.

Note: just having a larger table doesn't handle normalization, as long as we accept Unicode it will always be able to be an issue for someone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at Nim's documentation.

Case insensitive comparisons of identifiers: https://nim-lang.org/docs/manual.html#lexical-analysis-identifier-equality
So that would be a "fun" idea to propose.

Anyway, Nim transcompiles to C or C++. We don't.

All their identifiers get verified by the C/C++ compiler, which also means in practice they are defined against UAX31 already with normalization (erroring). Even if their language manual doesn't.

We would not be copying them if we changed our tables. They still have a lot more logic going on than we have had.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we don't do normalization. If a user needs it, he can run a filter over his source files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot recommend this.

You have to implement a check for normalization, and it has to be a complete isNFC check to be C23 compliant, unless you want code to not compile that is required to compile.

From C23:

C requires that identifiers be in Normalization Form C and therefore identifiers that compare the
same under NFC are equivalent. This is described in 6.4.2.

To detect if you are in form C requires that you basically perform normalization, except you do not store the result. https://unicode.org/reports/tr15/#Detecting_Normalization_Forms

That includes having all the same tables as you would've had otherwise. Not a great tradeoff. Especially when you consider code formatters should be doing it only on identifiers anyway.


Unless you use non-ASCII characters, this change will not affect you.

Due to bug [24507](https://issues.dlang.org/show_bug.cgi?id=24507) if you use GCC as your preprocessor with any identifier that is not normalized to form C, by default it will produce a warning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just support C23 and leave it at that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are, the problem is GCC is picking different behaviors on failure that we want to override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the rest of the text, normalization form C is the big breaking change that people need to be aware of in this PR.

Which is what C23 is defined against.

D.5 Equivalent Normalized Identifiers
1 UAX #31 requires that implementations describe how identifiers are compared and considered
equivalent.
2 C requires that identifiers be in Normalization Form C and therefore identifiers that compare the
same under NFC are equivalent. This is described in 6.4.2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if anyone even uses normalized identifiers. If we support various incompatible normalized identifier schemes, then the user will be creating code that won't link with other ImportC code.

I suggest pick the C23 scheme and not do any of these other bug accommodations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does gcc have some flag to pick different schemes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. They can pick between a few, but the important ones are C and KC.

I do not want to offer any other normalization forms than C to a user.
D form is easily done once you've got C (it's a subset) so I'll write the wrapper function for it during the next PR that adds the normalization function itself.
But the KC/KD require extra tables, while I'm not against seeing them added I don't see a reason for dmd to need it.

The compiler switches introduced here are to pick what to do when it is not normalized.
Right now that is to accept it as-is (with varying levels of errorness) and later that would be to normalize it.
Without this, I cannot verify the quick check algorithm as part of the test suite currently.

The goal of the switches is to give a ± 10 release window when people can transition forward and backward as required.
That window can be changed, but I believe that may work well so as to try and get one GDC release on either side of changing defaults to C23.

We may not keep any of these switches long-term.
Before removing them happens ImportC should be defined against C23 and we need to transition everyone over to C23.
Once that happens picking tables becomes unnecessary and can all be cleaned up.

But until then to switch 100% over right now would be a breaking change without a transition backwards hence the need for switches.
Unless of course, you're telling me to do the hard break, at which point I'll do a different PR removing all that infrastructure.

Iain may have opinions on switches, as well as if we support NFKC. So we do need to hear from him.

@rikkimax
Copy link
Contributor Author

Why is 2000 lines being removed?

All the Unicode tables got a reformat to be more compact when it is generated.

Intellij really wasn't happy with the significant line count with only one entry per line.

Pairs now have 4, otherwise it's 8 values per line. Now it even syntax highlights!

@rikkimax rikkimax force-pushed the uax31-quickcheck branch 7 times, most recently from 6c0655b to 521c6b1 Compare April 18, 2024 17:13
@rikkimax
Copy link
Contributor Author

Ignoring coverage upload, CI is green.

However, I've remembered that the table generator must not emit entries that span multiple planes. They must be split.

@rikkimax
Copy link
Contributor Author

Right fixed, keeping all the planes separate, actually made the index table for CCC smaller.

@rikkimax
Copy link
Contributor Author

Adding some more test cases would probably be a good idea for the tables.

Could probably generate the test cases with the help of the table generator, will do that another day, but it is getting closer I think.

@rikkimax
Copy link
Contributor Author

Tests for all tables start/middle/end against their indexes have been added.

Please review @WalterBright, scope should be met for this PR now and I've tried to do everything you wanted so if I haven't I need more clarification.

@rikkimax
Copy link
Contributor Author

Right, we are at an impasse and it is not being resolved for 2.109.

@WalterBright since you don't like my solution I'll let you figure out what you want to implement instead of normalization like I wanted to do.

A bug fix that was included in this has been split out into #16410 it must go into 2.109 otherwise the ImportC character table switch won't work.

@rikkimax rikkimax closed this Apr 25, 2024
@WalterBright
Copy link
Member

@rikkimax I still like the 2000 line reduction in code!

@rikkimax
Copy link
Contributor Author

Unfortunately, that was just a reformatting!

You are welcome to PR anything you do like out of this PR, the branch will stay in case you change your mind as it was complete and passing CI.

@WalterBright
Copy link
Member

Thank you, @rikkimax ! I really do appreciate the time and effort you have put into this. Yes, I would like to keep the branch around in case I'm wrong (although I have yet to be wrong about anything :-) but ya never know! ).

@WalterBright
Copy link
Member

@rikkimax I'd like to see this added as a library function!

@rikkimax
Copy link
Contributor Author

@rikkimax I'd like to see this added as a library function!

Looks like std.uni is missing a isNFC, while interesting, it's pretty low on my interest list right now.

Contrary to what it may appear, my focus upstream is more on foundational stabilization of D than niceties like this function.

Shared library support, identifiers, memory safety + temporal, reference counting, things like that are more important to me atm.

@WalterBright
Copy link
Member

I sure hate to not use all the work you've put into this. It sure would be ideal as a library function.

@rikkimax
Copy link
Contributor Author

It's not a huge loss, other parts of the PR will get merged in eventually.

The improvements to the search algorithm for bucketing will be a good speed up, unfortunately, the increased table sizes gave a bit of a performance degradation.

I'll do it if you don't want to probably for a point release.

For now, I have my own stuff to work on as well as the owner escape analysis DIP, which has come along quite nicely. Already heard positive things about it, and Bruce was able to finally understand what DIP1000 is trying to do! He is quite keen to talk to you about it at next meetup!

@rikkimax
Copy link
Contributor Author

If you really want to see some of my work not go to waste, why not have a look at my member of operator PR?

#16161

This unblocks my design for sum types and solves the zero size struct problem for value type exceptions (aka zero size struct exceptions ala C++ proposal).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants