Implement C23 identifiers via UAX31 (minus normalization) #15307

rikkimax · 2023-06-11T04:36:41Z

Okay the situation currently is significantly worse than even I had known.

We are only doing is alpha checks and no differentiation between start/continue. We were NOT c99 compliant.

Also the legacy start ranges turn out to be quite big (439).

This work was already discussed with @WalterBright, and this PR does not make us C23 compliant as per Walter's agreement on approach. I have added deprecation comments for the logic that will go away eventually as per Walter's agreement for the approach (I set it to 2.110 and 2.120 but those are pretty random in choice, so if there are better ones please say).

The approach that was agreed upon is to remove the non-starter characters. In doing so you enforce normalization form C. Alternatively we could implement the quick check algorithm (or even normalization) and acquire C23 compliance. However both deprecation/removal and implementation of normalization can follow in another PR. This one does not break any code, only adds ranges.

Changelog/spec PR will come after feedback from Walter.

EDIT: This PR's approach has since changed to match the Unicode Annex 31 standard as specified by the C23 standard. It is a subset thereof which only includes the tables. Normalization is not handled here. That needs follow up PR's.

TLDR: This PR changes D's identifiers to be foundationally more stable and match other compilers by using the Unicode Annex 31 standard. It also makes ImportC more compliant with C11 standard while offering more configurability if needed on picking the Identifier tables.

dlang-bot · 2023-06-11T04:36:44Z

Thanks for your pull request and interest in making D better, @rikkimax! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
I have provided a detailed rationale explaining my changes
New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.

If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#15307"

rikkimax · 2023-06-11T06:47:26Z

Only ocean failing to CI, same error in another PR so not something I did.

One thing we may want to consider is supporting C11 and C99 ranges specifically as opt-in options for ImportC.

rikkimax · 2023-06-13T06:37:41Z

I've had some more time to think about this. isValidMangling needs to be removed. Linkers should not care about what characters go into a symbol name. But they do care about encoding (which we have problems with that need to be fixed).

Until I've done this, this PR will break existing code. However, the rest of the code still needs a review @WalterBright.

EDIT: all of the doc.d stuff is wrong as well. However we may want a combined start/continue set of tables for this.

ibuclaw · 2023-06-13T07:55:28Z

I've had some more time to think about this. isValidMangling needs to be removed. Linkers should not care about what characters go into a symbol name. But they do care about encoding (which we have problems with that need to be fixed).

isValidMangling is not only there for linkers. Assemblers have special/reserved characters that should never go in a symbol name - for example ! $ - ; ? # | (all those characters are not valid C symbols either).

rikkimax · 2023-06-13T08:03:10Z

I've had some more time to think about this. isValidMangling needs to be removed. Linkers should not care about what characters go into a symbol name. But they do care about encoding (which we have problems with that need to be fixed).

isValidMangling is not only there for linkers. Assemblers have special/reserved characters that should never go in a symbol name - for example ! $ - ; ? # | (all those characters are not valid C symbols either).

Okay, we'll need a special table for this. Unless I'm told what the character ranges should be (ideally in the form of a UAX31 profile), I'll have to revert to the legacy table (which is a bad thing long term).

ibuclaw · 2023-06-13T08:26:34Z

Without it, you'd also be permitted to have \0 and other non-printable characters in a symbol name - yes, someone put pragma(mangle, "foo\0bar") in the testsuite and I complained about it breaking assemblers. :-)

rikkimax · 2023-06-13T08:33:55Z

Without it, you'd also be permitted to have \0 and other non-printable characters in a symbol name - yes, someone put pragma(mangle, "foo\0bar") in the testsuite and I complained about it breaking assemblers. :-)

Fair enough, I'm convinced. I just need to know what the profile is to implement (as that isn't related to D identifiers). See: https://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers

ibuclaw · 2023-06-13T14:58:50Z

I had a quick peek at the universal character set tables in gcc, they only store the last part of the unicode character - I assume they must have some other routine that validates the value of the first part.

It looks like not all values seem to match what you've auto-generated here?

https://github.com/gcc-mirror/gcc/blob/a07dadba85f1b15e270c227dfa70e2fdf331494f/libcpp/ucnid.h#L55

rikkimax · 2023-06-13T20:34:03Z

NXX23 will be C23. Second to last column will be ccc (0 is starter). Last is character only. The functions in it are for the quick check algorithm. Only way for my tables to be fundamentally wrong is the not/intercept methods on ValueRanges. A lot of it is provided by Unicode.

…

On Wed, Jun 14, 2023, 02:59 Iain Buclaw ***@***.***> wrote: I had a quick peek at the universal character set tables in gcc, they only store the last part of the unicode character - I assume they must have some other routine that validates the value of the first part. It looks like not all values seem to match what you've auto-generated here? https://github.com/gcc-mirror/gcc/blob/a07dadba85f1b15e270c227dfa70e2fdf331494f/libcpp/ucnid.h#L55 — Reply to this email directly, view it on GitHub <#15307 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHSL47JFEWFEX36HKHMZGDXLB53JANCNFSM6AAAAAAZCC3F54> . You are receiving this because you authored the thread.Message ID: ***@***.***>

rikkimax · 2023-06-23T17:56:17Z

I've been thinking a bit about this PR for a couple of things.

Mangling checks should cover the ranges of: whitespace, punctuation, and control. Will need to allow previously legacy characters too.
I think I want to swap to using inversion lists + Fibonacci search. This should be faster, but I'll do it in another PR when I start doing some reading on search algorithms.

compiler/src/dmd/common/unicode_tables.d

rikkimax · 2023-06-27T19:48:55Z

@deadalnix has been on the ball for this and done a very good optimized table lookup based upon this PR for SDC.

snazzy-d/sdc@199b363

WalterBright · 2023-07-02T05:29:13Z

@rikkimax this is good work. I can't help but think, though, what is the problem being solved? It's 3000 lines, do any of our users need it?

rikkimax · 2023-07-02T05:37:00Z

Most of the code is in the tables and is 100% generated. They would be a whole lot smaller if I didn't have to split them up into starters/non-starters + legacy (which would be deprecated regardless). There is also alpha tables that'll be removed once I do the mangling/ddoc identifier implementation (yes they are different!).

My main concern is not just what we need now but in 10-20 years time (which this would easily cover), or the ranges for ImportC C99/C11 to be pickable (not in PR, but I have a plan if we want that). But there are some serious problems with our existing implementation. Like how the mangling checks are a whitelist that is the same as the D identifiers, so you can't even use a C23 identifier right now if it isn't in our legacy ranges that may not even be compliant C99!

rikkimax · 2023-07-02T05:45:11Z

The existing implementation is pretty questionable from what I've seen, which is quite worrying. For example, it's possible to use a non-starter as a first character of an identifier. That may not even render depending on the text renderer, or if it does it'll be a box or even worse, merge into punctuation or whitespace.

Ignoring the normalization problems and not tracking the current standard; between the white list mangling, and non-starter start character for identifier, it needs work regardless of this PR.

WalterBright · 2023-07-02T19:39:11Z

I'm a bit confused. I don't see how identifier characters affect name mangling or the linker at all. The mangled identifier names are prefixed with a count, the contents do not matter at all. Identifiers are in the object file as zero-terminated strings - that shouldn't affect the linker at all, either.

rikkimax · 2023-07-02T19:41:59Z

I wish you were right.

dmd/compiler/src/dmd/dmangle.d

Line 80 in 6e35f85

isUniAlpha(c);

I would prefer to remove these checks entirely. But Iain wants a set of checks to remain cos of assemblers. So control + whitespace + punctuation would be black listed is my current plan.

WalterBright · 2023-07-03T02:47:33Z

Check out where it's used. The first is in pragma(mangle, "string") doing a sanity check on the string, the second is in doGNUABITagSemantic() which I think is to check C++ name mangling. Neither of those need to limit the character set of mangled names or names put in the object file.

gdc and ldc may have limits on identifier character sets, but dmd does not.

rikkimax · 2023-07-03T03:17:48Z

Dmd has a whitelist currently as you say. I wanted to remove them.

@ibuclaw brought the support receipts so he wants certain things like null terminators to be denied.

This one is between you two. My proposal is a blacklist with control characters + whitespace + punctuation, this will be a good solution long term.

You two should talk and come to a decision about what I do about it.

rikkimax · 2023-07-04T19:14:04Z

Ok, given the recent policy adjustment for deprecations, I need to know if we will be continuing this approach or if we are switching to the quick check algorithm like GCC does? So that only problematic identifiers warn/error, instead of perfectly fine ones.

nothing should be deprecated unless there's a very compelling reason ("this feature was a mistake and people shouldn't be using it" is not a compelling reason)

@WalterBright

rikkimax · 2023-07-18T16:49:34Z

I've been thinking about the cost of actually doing the quick check algorithm and I know how to do it with only the cost of one extra table and if the check occurs, it could be incredibly cheap!

First, Unicode characters only use 29 bits of a dchar, which leaves 3 we can use. The bottom-most bit can be used to represent the yes/no/maybe value. At runtime this should only require a single left shift at the start of the search function.

Next we need the CCC value for a given character, since we only store what is in our identifier ranges this should hopefully allow a lot of savings in ROM space with a multi-layer trie.

There would need to be two sets of logic, the first happens in the lookup function for start/continue. A simple comparison for the ccc value and the last one, and the yes/no/maybe value being set with an or.

In the caller, there would be two additional variables (passed by ref). Is not normalized, and the last ccc value. From there the cli arg to pick silent accept/warn/error behavior could occur.

What this means is we don't have to get rid of the non-starters and risk breaking code. It's entirely possible to keep them without slowing things down too much! For example, a single character ASCII identifier should only need to check the variable for if not normalized. If multi-characters, it only does the logic if it succeeds.

For reference the Java algorithm, minus the UTF-16 specific stuff:

public int quickCheck(String source) {
    short lastCanonicalClass = 0;
    int result = YES;
    for (int i = 0; i < source.length(); ++i) {
        int ch = source.codepointAt(i);
        short canonicalClass = getCanonicalClass(ch);
        if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
            return NO;
        }
        int check = isAllowed(ch);
        if (check == NO) return NO;
        if (check == MAYBE) result = MAYBE;
        lastCanonicalClass = canonicalClass;
    }
    return result;
}

public static final int NO = 0, YES = 1, MAYBE = -1;

rikkimax · 2024-02-04T15:46:39Z

OOM again even with @rainers's fix.

However, I have a theory. The increase for dmd ~5mb to 9mb is nowhere near enough to be triggering OOM's as much as it is.

But... there is a possibility here, that paralleled running of the test suite, could make it add up.

So I've gone ahead and swapped the default for the test runner to be same number of as cpus instead of double.

rikkimax · 2024-02-04T16:24:45Z

A quick look over the runners:

https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories

16gb of ram with 4 cores for Windows, 14gb ram and 3 cores on OSX. That is 4gb and 4.6gb of ram per test.

That is a lot of ram to be getting eaten up to cause OOM's.

However, what I'm doing is clearly not working.

kinke · 2024-02-04T16:33:17Z

Note that this seems to be a new CI regression now, happening way more consistently across the board (master pipelines, other PRs), and now concerning a compiler testsuite test (runnable/testprofile.d), not the previous sporadically-failing druntime tests (edit: and not running in parallel with those). Not sure when it started. The test itself seems not to allocate anything really: https://github.com/dlang/dmd/blob/master/compiler/test/runnable/testprofile.d

So I've gone ahead and swapped the default for the test runner to be same number of as cpus instead of double.

IIRC, we explicitly specify the parallelism (not relying on defaults) for most CI runs, e.g.,

dmd/.azure-pipelines/windows.sh

Line 145 in ab45c77

./run --environment --jobs=$N "${targets[@]}" "${args[@]}" CC="$CC"

rikkimax · 2024-02-04T16:41:45Z

Yeah.

I went and decreased the cpu count manually in the scripts as well for N.

I'm currently in d_do_test and ran the failing one manually, whatever is going on I haven't been able to find what is going on nor find anything eating gigabytes of ram.

Let me know and I'll rebase (with these CI related changes removed), not worth bothering to do it until a fix is found.

rikkimax · 2024-02-09T15:10:21Z

@WalterBright apart from the CI coverage upload failing, CI is green, so waiting on you again.

etienne02 · 2024-03-19T08:57:12Z

compiler/src/dmd/common/charactertables.d

+///
+unittest
+{
+    assert(isAnyContinue('ğ'));


I think you wanted to test isAnyIdentifierCharacter here.
isAnyContinue is defined and tested later

benjones · 2024-03-22T16:18:19Z

Seeing test failures that I assume are related to this on recent PRs: https://dev.azure.com/dlanguage/dmd/_build/results?buildId=41950&view=logs&jobId=8af2f344-529d-5953-c6e9-e9e8d82c041d&j=8af2f344-529d-5953-c6e9-e9e8d82c041d&t=007a24f2-5f9f-5584-8be8-7de5c1d9b6f5

rikkimax · 2024-03-22T17:02:43Z

I don't think it is my fault. The failure is in writing the object file which I haven't touched. Which is a very interesting error for us to be getting.

…

On Sat, 23 Mar 2024, 05:18 Ben Jones, ***@***.***> wrote: Seeing test failures that I assume are related to this on recent PRs: https://dev.azure.com/dlanguage/dmd/_build/results?buildId=41950&view=logs&jobId=8af2f344-529d-5953-c6e9-e9e8d82c041d&j=8af2f344-529d-5953-c6e9-e9e8d82c041d&t=007a24f2-5f9f-5584-8be8-7de5c1d9b6f5 — Reply to this email directly, view it on GitHub <#15307 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHSL47G54CKGXYSHDCZFPLYZRKWBAVCNFSM6AAAAAAZCC3F56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGQ2DEMJTGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

benjones · 2024-03-22T17:16:12Z

My guess is that the object code generation was never tested on symbols with weird unicode stuff before? Seems like only 32 bit windows stuff is broken, I think for obj files (so probably something in backend/dobj.d). I'll try to take a look when I get a chance, but am not sure what I'd be looking at...

rikkimax · 2024-03-22T17:24:26Z

I think that you may be thinking about the wrong place. I'm in bed so I can't dive into it right now, but I would look at file writing, not generation. But yes there are problems with Unicode in symbols on windows, but that shows up with linking which this isn't doing. If there was a problem with generation I would have expected a problem with UAX31 ranges, what is failing is c99 which is what we were defined against.

…

On Sat, 23 Mar 2024, 06:16 Ben Jones, ***@***.***> wrote: My guess is that the object code generation was never tested on symbols with weird unicode stuff before? Seems like only 32 bit windows stuff is broken, I think for obj files (so probably something in backend/dobj.d). I'll try to take a look when I get a chance, but am not sure what I'd be looking at... — Reply to this email directly, view it on GitHub <#15307 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHSL42SN6P4KNHLYTLD4XDYZRRPFAVCNFSM6AAAAAAZCC3F56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGU2DMMJXGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rikkimax · 2024-03-22T17:56:49Z

Actually ya know what this could be? File already exists error that due to concurrency is failing randomly. It is because I named tests for both c and d the same name. Give them the language prefix, and it should resolve it. Docs need this gotcha added too.

…

On Sat, 23 Mar 2024, 06:24 rikki cattermole, ***@***.***> wrote: I think that you may be thinking about the wrong place. I'm in bed so I can't dive into it right now, but I would look at file writing, not generation. But yes there are problems with Unicode in symbols on windows, but that shows up with linking which this isn't doing. If there was a problem with generation I would have expected a problem with UAX31 ranges, what is failing is c99 which is what we were defined against. On Sat, 23 Mar 2024, 06:16 Ben Jones, ***@***.***> wrote: > My guess is that the object code generation was never tested on symbols > with weird unicode stuff before? Seems like only 32 bit windows stuff is > broken, I think for obj files (so probably something in backend/dobj.d). > I'll try to take a look when I get a chance, but am not sure what I'd be > looking at... > > — > Reply to this email directly, view it on GitHub > <#15307 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAHSL42SN6P4KNHLYTLD4XDYZRRPFAVCNFSM6AAAAAAZCC3F56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGU2DMMJXGQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

kinke · 2024-05-25T11:49:11Z

compiler/src/dmd/cli.d

+                    $(LI $(I UAX31): UAX31)
+                    $(LI $(I c99): C99)
+                    $(LI $(I c11): C11)
+                    $(LI $(I all): All, the least restrictive set, which comes all others (default))


What does 'comes all others' mean? I'm no native speaker, but it doesn't make sense to me.

Looks like I missed out a word "with".

rikkimax force-pushed the c23-identifiers branch 6 times, most recently from 4b8ffa0 to db60b5d Compare June 11, 2023 05:17

deadalnix reviewed Jun 26, 2023

View reviewed changes

compiler/src/dmd/common/unicode_tables.d Outdated Show resolved Hide resolved

rikkimax force-pushed the c23-identifiers branch 2 times, most recently from d2ce90a to 6e35f85 Compare June 26, 2023 15:39

dlang-bot added the stalled label Oct 17, 2023

rikkimax force-pushed the c23-identifiers branch 3 times, most recently from d0710b9 to aed6bf3 Compare February 4, 2024 15:43

rikkimax force-pushed the c23-identifiers branch from aed6bf3 to 416874b Compare February 4, 2024 16:00

rikkimax force-pushed the c23-identifiers branch from 416874b to 615b07a Compare February 5, 2024 12:21

rikkimax marked this pull request as ready for review February 5, 2024 13:27

rikkimax requested a review from ibuclaw as a code owner February 5, 2024 13:27

dlang-bot added the Needs Rebase label Feb 14, 2024

rikkimax force-pushed the c23-identifiers branch from 615b07a to bad6c00 Compare February 24, 2024 07:34

dlang-bot removed the Needs Rebase label Feb 24, 2024

dlang-bot added the Needs Rebase label Mar 12, 2024

rikkimax force-pushed the c23-identifiers branch from bad6c00 to 49fdfb1 Compare March 16, 2024 17:14

dlang-bot removed the Needs Rebase label Mar 16, 2024

Implement UAX31 character ranges

35435c8

rikkimax force-pushed the c23-identifiers branch from 49fdfb1 to 35435c8 Compare March 16, 2024 17:40

RazvanN7 added the Walter Bright label Mar 18, 2024

WalterBright merged commit dffd899 into dlang:master Mar 18, 2024
48 checks passed

etienne02 reviewed Mar 19, 2024

View reviewed changes

kinke reviewed May 25, 2024

View reviewed changes

rikkimax deleted the c23-identifiers branch June 3, 2024 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement C23 identifiers via UAX31 (minus normalization) #15307

Implement C23 identifiers via UAX31 (minus normalization) #15307

rikkimax commented Jun 11, 2023 •

edited

Loading

dlang-bot commented Jun 11, 2023

rikkimax commented Jun 11, 2023

rikkimax commented Jun 13, 2023 •

edited

Loading

ibuclaw commented Jun 13, 2023 •

edited

Loading

rikkimax commented Jun 13, 2023

ibuclaw commented Jun 13, 2023

rikkimax commented Jun 13, 2023

ibuclaw commented Jun 13, 2023

rikkimax commented Jun 13, 2023 via email

rikkimax commented Jun 23, 2023

rikkimax commented Jun 27, 2023

WalterBright commented Jul 2, 2023

rikkimax commented Jul 2, 2023

rikkimax commented Jul 2, 2023

WalterBright commented Jul 2, 2023

rikkimax commented Jul 2, 2023

WalterBright commented Jul 3, 2023

rikkimax commented Jul 3, 2023

rikkimax commented Jul 4, 2023

rikkimax commented Jul 18, 2023

rikkimax commented Feb 4, 2024 •

edited

Loading

rikkimax commented Feb 4, 2024

kinke commented Feb 4, 2024 •

edited

Loading

rikkimax commented Feb 4, 2024

rikkimax commented Feb 9, 2024

etienne02 Mar 19, 2024

rikkimax Mar 19, 2024

benjones commented Mar 22, 2024

rikkimax commented Mar 22, 2024 via email

benjones commented Mar 22, 2024

rikkimax commented Mar 22, 2024 via email

rikkimax commented Mar 22, 2024 via email

kinke May 25, 2024

rikkimax May 25, 2024

Implement C23 identifiers via UAX31 (minus normalization) #15307

Implement C23 identifiers via UAX31 (minus normalization) #15307

Conversation

rikkimax commented Jun 11, 2023 • edited Loading

dlang-bot commented Jun 11, 2023

Bugzilla references

Testing this PR locally

rikkimax commented Jun 11, 2023

rikkimax commented Jun 13, 2023 • edited Loading

ibuclaw commented Jun 13, 2023 • edited Loading

rikkimax commented Jun 13, 2023

ibuclaw commented Jun 13, 2023

rikkimax commented Jun 13, 2023

ibuclaw commented Jun 13, 2023

rikkimax commented Jun 13, 2023 via email

rikkimax commented Jun 23, 2023

rikkimax commented Jun 27, 2023

WalterBright commented Jul 2, 2023

rikkimax commented Jul 2, 2023

rikkimax commented Jul 2, 2023

WalterBright commented Jul 2, 2023

rikkimax commented Jul 2, 2023

WalterBright commented Jul 3, 2023

rikkimax commented Jul 3, 2023

rikkimax commented Jul 4, 2023

rikkimax commented Jul 18, 2023

rikkimax commented Feb 4, 2024 • edited Loading

rikkimax commented Feb 4, 2024

kinke commented Feb 4, 2024 • edited Loading

rikkimax commented Feb 4, 2024

rikkimax commented Feb 9, 2024

etienne02 Mar 19, 2024

Choose a reason for hiding this comment

rikkimax Mar 19, 2024

Choose a reason for hiding this comment

benjones commented Mar 22, 2024

rikkimax commented Mar 22, 2024 via email

benjones commented Mar 22, 2024

rikkimax commented Mar 22, 2024 via email

rikkimax commented Mar 22, 2024 via email

kinke May 25, 2024

Choose a reason for hiding this comment

rikkimax May 25, 2024

Choose a reason for hiding this comment

rikkimax commented Jun 11, 2023 •

edited

Loading

rikkimax commented Jun 13, 2023 •

edited

Loading

ibuclaw commented Jun 13, 2023 •

edited

Loading

rikkimax commented Feb 4, 2024 •

edited

Loading

kinke commented Feb 4, 2024 •

edited

Loading