Implement UAX31's quick check algorithm #16383

rikkimax · 2024-04-14T15:52:21Z

PR number 2.

A few changes from the original approach:

Still using binary search, but it is now bucketed
Not using inverse lists
Could not merge the quick check algorithm yes/no/maybe value into the tables

Fibonaccian search unfortunately did not win out in my benchmarks for what I tried and it required a lot of extra ROM for both the standard algorithms.

Two things I want to draw your attention to @WalterBright:

In lexer the addition of acquireIdentifierFromString function. I'm going to guess that you won't like this function existing, I don't. But I couldn't find a way to merge the pathways wrt. universal character names + D identifiers. So it had to be split out.
In identifier I added idPoolNormalize, there is pseudo code for what that function needs to do once toNFC has been implemented. I don't particularly want to touch string interning itself, so I'd appreciate if you could look at it and give me some feedback about how you would want it to be implemented or do a pass over it to make it possible.

dlang-bot · 2024-04-14T15:52:24Z

Thanks for your pull request and interest in making D better, @rikkimax! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
I have provided a detailed rationale explaining my changes
New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.

If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#16383"

rikkimax · 2024-04-15T17:58:39Z

GCC was very helpful in erroring out with ImportC test.

 ... compilable/normalize_silent.c  -identifiers-importc=UAX31 -normalization-importc=ignore  -fPIC ()==============================
Test 'compilable/normalize_silent.c' failed. The logged output:
/home/runner/work/dmd/dmd/generated/linux/release/64/dmd -conf= -m64 -Icompilable -identifiers-importc=UAX31 -normalization-importc=ignore  -fPIC  -od/home/runner/work/dmd/dmd/compiler/test/test_results/compilable/c -of/home/runner/work/dmd/dmd/compiler/test/test_results/compilable/c/normalize_silent_0.o  -c compilable/normalize_silent.c 
compilable/normalize_silent.c:10:5: warning: `s\U00000345\U00000315\U00000300e' is not in NFC [-Wnormalized=]
   10 | int s\u0345\u0315\u0300e;
      |     ^

Had to disable silently accept there.

If we had a way of knowing that the CPP is GCC, we'd be able to turn this warning off and therefore turn the warning back on.

rikkimax · 2024-04-15T18:19:06Z

I have no idea why core.exception.AssertError@src/dmd/common/charactertables.d(141): unittest failure is failing. It isn't doing it locally and if I explicitly error it out, it errors out.

Nor is it doing it in my benchmark testsuite which verifies it against the old search algorithm.

rikkimax · 2024-04-15T19:13:12Z

Command: /opt/hostedtoolcache/dc/dmd-2.108.0/x64/dmd2/linux/bin64/dmd -c -of/home/runner/work/dmd/dmd/dmd/generated/linux/release/32/common.o -w -de -fPIC -m32 -J/home/runner/work/dmd/dmd/dmd/generated/linux/release/32 -I/home/runner/work/dmd/dmd/dmd/compiler/src -g -color=on src/dmd/common/bitfields.d src/dmd/common/file.d src/dmd/common/int128.d src/dmd/common/blake3.d src/dmd/common/outbuffer.d src/dmd/common/smallbuffer.d src/dmd/common/charactertables.d src/dmd/common/identifiertables.d

-----------------------------------------------------------
src/dmd/common/charactertables.d(138): Error: static assert:  [862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u, 862u]

Oh hello there, so the buckets aren't being computed...

Table itself is ok, but not the computation of the bucket offets.

rikkimax · 2024-04-15T20:22:05Z

Looks like there is a bug some place regarding signed vs unsigned integer comparisons at CTFE.

Test 'fail_compilation/normalize_error.c' failed. The logged output:
/home/runner/work/dmd/dmd/generated/linux/release/64/dmd -conf= -m64 -Ifail_compilation -verrors=0 -identifiers-importc=UAX31 -normalization-importc=error  -fPIC  -od/home/runner/work/dmd/dmd/compiler/test/test_results/fail_compilation/c -of/home/runner/work/dmd/dmd/compiler/test/test_results/fail_compilation/c/normalize_error_0.o  -c fail_compilation/normalize_error.c 
fail_compilation/normalize_error.c:18:5: warning: `s\U00000345\U00000315\U00000300e' is not in NFC [-Wnormalized=]
   18 | int s\u0345\u0315\u0300e;
      |     ^

Makes sense to be getting that, preprocessor runs before us. Gonna have to disable it as well, not much I can do without knowing which preprocessor it is.

rikkimax · 2024-04-15T21:19:36Z

Looks like there is a bug hiding out for the UAX31_Continue table with the bucketing algorithm.

It'll need to be fixed prior to merging. But that is not a problem for today.

WalterBright · 2024-04-16T05:37:05Z

compiler/src/dmd/identifier.d

+        auto sNormalized = toNFC(s);
+        if (sNormalized in stringtable)
+        {
+            free(sNormalized);


The exact function doesn't matter it's pseudo code.

What I'm worried about is whether the string interner stringtable is capable of handling the pattern of storing unnormalized as key to normalized value.

Write an identifier one way 100 times, which should result in normalization only the 1 time if it wasn't normalized.

This is something I need feedback on before I start next PR in a few months time.

stringtable uses a hash of the bytes in the identifier. It does not attempt normalization. Doing that would slow it way down.

Yes, that isn't what I am suggesting.

The stringtable is only responsible for storing key to value map. Unlike currently I would need to make the key be a different string to the value.

Turns out this is as simple as adding an extra parameter (key) which gets sent to the hash instead of the value parameter.

The quick check algorithm quickly determines if normalization needs to occur for Unicode, for ASCII its always off.

idPoolNormalize then takes whether it is currently normalized and performs the normalization calling into the stringtable as describe in the pseudo code.

I do believe this is all solved for next PR.

compiler/src/dmd/globals.d

WalterBright · 2024-04-16T05:38:39Z

compiler/src/dmd/cli.d

@@ -639,6 +639,26 @@ dmd -cov -unittest myprog.d
            `Turns off all array bounds checking, even for safe functions. $(RED Deprecated
            (use $(TT $(SWLINK -boundscheck)=off) instead).)`,
        ),
+        Option("normalization=<strategy>",


Why do we need all these strategies? Doesn't C23 specify behavior?

C23 does not specify. Neither does UAX31, as long as we handle it in some stated way (or pickable) we are good.

We don't need the error case as that is already handled with warning as error. We also don't need to expose deprecation externally.

Next PR would be normalization that would get exposed (currently internal only) and later made the default.

Pick the simplest way and leave it at that. Why should we do normalization at all? I'm beginning to like Nim's strategy of every non-ASCII code unit is an identifier character.

If we don't handle normalization in some form a D identifier may look the same, but won't actually match to a C identifier.

Erroring out basically requires a subset of the normalization process except less well-documented. I haven't implemented the full algorithm, only enough to decrease the cost of normalization to "don't worry about it" levels.

Performing normalization autocorrects, and the pseudo code in idPoolNormalize means you only have to do it once per string representation.

Note: just having a larger table doesn't handle normalization, as long as we accept Unicode it will always be able to be an issue for someone.

I'm looking at Nim's documentation.

Case insensitive comparisons of identifiers: https://nim-lang.org/docs/manual.html#lexical-analysis-identifier-equality
So that would be a "fun" idea to propose.

Anyway, Nim transcompiles to C or C++. We don't.

All their identifiers get verified by the C/C++ compiler, which also means in practice they are defined against UAX31 already with normalization (erroring). Even if their language manual doesn't.

We would not be copying them if we changed our tables. They still have a lot more logic going on than we have had.

I suggest we don't do normalization. If a user needs it, he can run a filter over his source files.

I cannot recommend this.

You have to implement a check for normalization, and it has to be a complete isNFC check to be C23 compliant, unless you want code to not compile that is required to compile.

From C23:

C requires that identifiers be in Normalization Form C and therefore identifiers that compare the
same under NFC are equivalent. This is described in 6.4.2.

To detect if you are in form C requires that you basically perform normalization, except you do not store the result. https://unicode.org/reports/tr15/#Detecting_Normalization_Forms

That includes having all the same tables as you would've had otherwise. Not a great tradeoff. Especially when you consider code formatters should be doing it only on identifiers anyway.

WalterBright · 2024-04-16T05:42:00Z

changelog/dmd.identifier-quick-check.dd

+
+Unless you use non-ASCII characters, this change will not affect you.
+
+Due to bug [24507](https://issues.dlang.org/show_bug.cgi?id=24507) if you use GCC as your preprocessor with any identifier that is not normalized to form C, by default it will produce a warning.


Can't we just support C23 and leave it at that?

We are, the problem is GCC is picking different behaviors on failure that we want to override.

As for the rest of the text, normalization form C is the big breaking change that people need to be aware of in this PR.

Which is what C23 is defined against.

D.5 Equivalent Normalized Identifiers
1 UAX #31 requires that implementations describe how identifiers are compared and considered
equivalent.
2 C requires that identifiers be in Normalization Form C and therefore identifiers that compare the
same under NFC are equivalent. This is described in 6.4.2.

I'd be surprised if anyone even uses normalized identifiers. If we support various incompatible normalized identifier schemes, then the user will be creating code that won't link with other ImportC code.

I suggest pick the C23 scheme and not do any of these other bug accommodations.

Does gcc have some flag to pick different schemes?

Yes. They can pick between a few, but the important ones are C and KC.

I do not want to offer any other normalization forms than C to a user.
D form is easily done once you've got C (it's a subset) so I'll write the wrapper function for it during the next PR that adds the normalization function itself.
But the KC/KD require extra tables, while I'm not against seeing them added I don't see a reason for dmd to need it.

The compiler switches introduced here are to pick what to do when it is not normalized.
Right now that is to accept it as-is (with varying levels of errorness) and later that would be to normalize it.
Without this, I cannot verify the quick check algorithm as part of the test suite currently.

The goal of the switches is to give a ± 10 release window when people can transition forward and backward as required.
That window can be changed, but I believe that may work well so as to try and get one GDC release on either side of changing defaults to C23.

We may not keep any of these switches long-term.
Before removing them happens ImportC should be defined against C23 and we need to transition everyone over to C23.
Once that happens picking tables becomes unnecessary and can all be cleaned up.

But until then to switch 100% over right now would be a breaking change without a transition backwards hence the need for switches.
Unless of course, you're telling me to do the hard break, at which point I'll do a different PR removing all that infrastructure.

Iain may have opinions on switches, as well as if we support NFKC. So we do need to hear from him.

rikkimax · 2024-04-16T10:45:32Z

Why is 2000 lines being removed?

All the Unicode tables got a reformat to be more compact when it is generated.

Intellij really wasn't happy with the significant line count with only one entry per line.

Pairs now have 4, otherwise it's 8 values per line. Now it even syntax highlights!

rikkimax · 2024-04-19T02:59:29Z

Ignoring coverage upload, CI is green.

However, I've remembered that the table generator must not emit entries that span multiple planes. They must be split.

rikkimax · 2024-04-19T03:28:06Z

Right fixed, keeping all the planes separate, actually made the index table for CCC smaller.

rikkimax · 2024-04-19T03:31:08Z

Adding some more test cases would probably be a good idea for the tables.

Could probably generate the test cases with the help of the table generator, will do that another day, but it is getting closer I think.

rikkimax · 2024-04-20T00:13:25Z

Tests for all tables start/middle/end against their indexes have been added.

Please review @WalterBright, scope should be met for this PR now and I've tried to do everything you wanted so if I haven't I need more clarification.

rikkimax · 2024-04-25T20:15:08Z

Right, we are at an impasse and it is not being resolved for 2.109.

@WalterBright since you don't like my solution I'll let you figure out what you want to implement instead of normalization like I wanted to do.

A bug fix that was included in this has been split out into #16410 it must go into 2.109 otherwise the ImportC character table switch won't work.

WalterBright · 2024-04-26T01:02:28Z

@rikkimax I still like the 2000 line reduction in code!

rikkimax · 2024-04-26T01:08:23Z

Unfortunately, that was just a reformatting!

You are welcome to PR anything you do like out of this PR, the branch will stay in case you change your mind as it was complete and passing CI.

WalterBright · 2024-04-26T01:34:59Z

Thank you, @rikkimax ! I really do appreciate the time and effort you have put into this. Yes, I would like to keep the branch around in case I'm wrong (although I have yet to be wrong about anything :-) but ya never know! ).

WalterBright · 2024-04-29T20:12:44Z

@rikkimax I'd like to see this added as a library function!

rikkimax · 2024-04-29T20:18:37Z

@rikkimax I'd like to see this added as a library function!

Looks like std.uni is missing a isNFC, while interesting, it's pretty low on my interest list right now.

Contrary to what it may appear, my focus upstream is more on foundational stabilization of D than niceties like this function.

Shared library support, identifiers, memory safety + temporal, reference counting, things like that are more important to me atm.

WalterBright · 2024-04-30T04:11:29Z

I sure hate to not use all the work you've put into this. It sure would be ideal as a library function.

rikkimax · 2024-04-30T04:20:17Z

It's not a huge loss, other parts of the PR will get merged in eventually.

The improvements to the search algorithm for bucketing will be a good speed up, unfortunately, the increased table sizes gave a bit of a performance degradation.

I'll do it if you don't want to probably for a point release.

For now, I have my own stuff to work on as well as the owner escape analysis DIP, which has come along quite nicely. Already heard positive things about it, and Bruce was able to finally understand what DIP1000 is trying to do! He is quite keen to talk to you about it at next meetup!

rikkimax · 2024-04-30T04:44:39Z

If you really want to see some of my work not go to waste, why not have a look at my member of operator PR?

#16161

This unblocks my design for sum types and solves the zero size struct problem for value type exceptions (aka zero size struct exceptions ala C++ proposal).

rikkimax requested a review from ibuclaw as a code owner April 14, 2024 15:52

rikkimax force-pushed the uax31-quickcheck branch 3 times, most recently from b0f7ebd to a44fc2a Compare April 15, 2024 17:56

rikkimax force-pushed the uax31-quickcheck branch 3 times, most recently from 0a76991 to 039666a Compare April 15, 2024 18:12

rikkimax force-pushed the uax31-quickcheck branch 2 times, most recently from 1efeb4f to f07a676 Compare April 15, 2024 19:10

rikkimax force-pushed the uax31-quickcheck branch 12 times, most recently from 0eaae20 to d29f012 Compare April 15, 2024 20:12

rikkimax force-pushed the uax31-quickcheck branch 3 times, most recently from 6f5972e to aa04be1 Compare April 15, 2024 20:44

WalterBright reviewed Apr 16, 2024

View reviewed changes

compiler/src/dmd/globals.d Show resolved Hide resolved

WalterBright reviewed Apr 16, 2024

View reviewed changes

rikkimax force-pushed the uax31-quickcheck branch 7 times, most recently from 6c0655b to 521c6b1 Compare April 18, 2024 17:13

rikkimax force-pushed the uax31-quickcheck branch from 521c6b1 to 58adbd2 Compare April 19, 2024 03:27

rikkimax force-pushed the uax31-quickcheck branch from 58adbd2 to fbc02c1 Compare April 20, 2024 00:11

Implement UAX31's quick check algorithm

a073763

rikkimax force-pushed the uax31-quickcheck branch from fbc02c1 to a073763 Compare April 20, 2024 00:18

rikkimax closed this Apr 25, 2024


		Unless you use non-ASCII characters, this change will not affect you.

		Due to bug [24507](https://issues.dlang.org/show_bug.cgi?id=24507) if you use GCC as your preprocessor with any identifier that is not normalized to form C, by default it will produce a warning.

Implement UAX31's quick check algorithm #16383

Implement UAX31's quick check algorithm #16383

Conversation

rikkimax commented Apr 14, 2024 • edited

dlang-bot commented Apr 14, 2024

Bugzilla references

Testing this PR locally

rikkimax commented Apr 15, 2024

rikkimax commented Apr 15, 2024

rikkimax commented Apr 15, 2024

rikkimax commented Apr 15, 2024

rikkimax commented Apr 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rikkimax commented Apr 16, 2024

rikkimax commented Apr 19, 2024

rikkimax commented Apr 19, 2024

rikkimax commented Apr 19, 2024

rikkimax commented Apr 20, 2024

rikkimax commented Apr 25, 2024

WalterBright commented Apr 26, 2024

rikkimax commented Apr 26, 2024

WalterBright commented Apr 26, 2024

WalterBright commented Apr 29, 2024

rikkimax commented Apr 29, 2024

WalterBright commented Apr 30, 2024

rikkimax commented Apr 30, 2024

rikkimax commented Apr 30, 2024

rikkimax commented Apr 14, 2024 •

edited