Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement C23 identifiers via UAX31 (minus normalization) #15307

Merged
merged 1 commit into from Mar 18, 2024

Conversation

rikkimax
Copy link
Contributor

@rikkimax rikkimax commented Jun 11, 2023

Okay the situation currently is significantly worse than even I had known.

We are only doing is alpha checks and no differentiation between start/continue. We were NOT c99 compliant.

Also the legacy start ranges turn out to be quite big (439).

This work was already discussed with @WalterBright, and this PR does not make us C23 compliant as per Walter's agreement on approach. I have added deprecation comments for the logic that will go away eventually as per Walter's agreement for the approach (I set it to 2.110 and 2.120 but those are pretty random in choice, so if there are better ones please say).

The approach that was agreed upon is to remove the non-starter characters. In doing so you enforce normalization form C. Alternatively we could implement the quick check algorithm (or even normalization) and acquire C23 compliance. However both deprecation/removal and implementation of normalization can follow in another PR. This one does not break any code, only adds ranges.

Changelog/spec PR will come after feedback from Walter.

EDIT: This PR's approach has since changed to match the Unicode Annex 31 standard as specified by the C23 standard. It is a subset thereof which only includes the tables. Normalization is not handled here. That needs follow up PR's.

TLDR: This PR changes D's identifiers to be foundationally more stable and match other compilers by using the Unicode Annex 31 standard. It also makes ImportC more compliant with C11 standard while offering more configurability if needed on picking the Identifier tables.

@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better, @rikkimax! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#15307"

@rikkimax rikkimax force-pushed the c23-identifiers branch 6 times, most recently from 4b8ffa0 to db60b5d Compare June 11, 2023 05:17
@rikkimax
Copy link
Contributor Author

Only ocean failing to CI, same error in another PR so not something I did.

One thing we may want to consider is supporting C11 and C99 ranges specifically as opt-in options for ImportC.

@rikkimax
Copy link
Contributor Author

rikkimax commented Jun 13, 2023

I've had some more time to think about this. isValidMangling needs to be removed. Linkers should not care about what characters go into a symbol name. But they do care about encoding (which we have problems with that need to be fixed).

Until I've done this, this PR will break existing code. However, the rest of the code still needs a review @WalterBright.

EDIT: all of the doc.d stuff is wrong as well. However we may want a combined start/continue set of tables for this.

@ibuclaw
Copy link
Member

ibuclaw commented Jun 13, 2023

I've had some more time to think about this. isValidMangling needs to be removed. Linkers should not care about what characters go into a symbol name. But they do care about encoding (which we have problems with that need to be fixed).

isValidMangling is not only there for linkers. Assemblers have special/reserved characters that should never go in a symbol name - for example ! $ - ; ? # | (all those characters are not valid C symbols either).

@rikkimax
Copy link
Contributor Author

I've had some more time to think about this. isValidMangling needs to be removed. Linkers should not care about what characters go into a symbol name. But they do care about encoding (which we have problems with that need to be fixed).

isValidMangling is not only there for linkers. Assemblers have special/reserved characters that should never go in a symbol name - for example ! $ - ; ? # | (all those characters are not valid C symbols either).

Okay, we'll need a special table for this. Unless I'm told what the character ranges should be (ideally in the form of a UAX31 profile), I'll have to revert to the legacy table (which is a bad thing long term).

@ibuclaw
Copy link
Member

ibuclaw commented Jun 13, 2023

Without it, you'd also be permitted to have \0 and other non-printable characters in a symbol name - yes, someone put pragma(mangle, "foo\0bar") in the testsuite and I complained about it breaking assemblers. :-)

@rikkimax
Copy link
Contributor Author

Without it, you'd also be permitted to have \0 and other non-printable characters in a symbol name - yes, someone put pragma(mangle, "foo\0bar") in the testsuite and I complained about it breaking assemblers. :-)

Fair enough, I'm convinced. I just need to know what the profile is to implement (as that isn't related to D identifiers). See: https://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers

@ibuclaw
Copy link
Member

ibuclaw commented Jun 13, 2023

I had a quick peek at the universal character set tables in gcc, they only store the last part of the unicode character - I assume they must have some other routine that validates the value of the first part.

It looks like not all values seem to match what you've auto-generated here?

https://github.com/gcc-mirror/gcc/blob/a07dadba85f1b15e270c227dfa70e2fdf331494f/libcpp/ucnid.h#L55

@rikkimax
Copy link
Contributor Author

rikkimax commented Jun 13, 2023 via email

@rikkimax
Copy link
Contributor Author

I've been thinking a bit about this PR for a couple of things.

  1. Mangling checks should cover the ranges of: whitespace, punctuation, and control. Will need to allow previously legacy characters too.
  2. I think I want to swap to using inversion lists + Fibonacci search. This should be faster, but I'll do it in another PR when I start doing some reading on search algorithms.

@rikkimax rikkimax force-pushed the c23-identifiers branch 2 times, most recently from d2ce90a to 6e35f85 Compare June 26, 2023 15:39
@rikkimax
Copy link
Contributor Author

@deadalnix has been on the ball for this and done a very good optimized table lookup based upon this PR for SDC.

snazzy-d/sdc@199b363

@WalterBright
Copy link
Member

@rikkimax this is good work. I can't help but think, though, what is the problem being solved? It's 3000 lines, do any of our users need it?

@rikkimax
Copy link
Contributor Author

rikkimax commented Jul 2, 2023

Most of the code is in the tables and is 100% generated. They would be a whole lot smaller if I didn't have to split them up into starters/non-starters + legacy (which would be deprecated regardless). There is also alpha tables that'll be removed once I do the mangling/ddoc identifier implementation (yes they are different!).

My main concern is not just what we need now but in 10-20 years time (which this would easily cover), or the ranges for ImportC C99/C11 to be pickable (not in PR, but I have a plan if we want that). But there are some serious problems with our existing implementation. Like how the mangling checks are a whitelist that is the same as the D identifiers, so you can't even use a C23 identifier right now if it isn't in our legacy ranges that may not even be compliant C99!

@rikkimax
Copy link
Contributor Author

rikkimax commented Jul 2, 2023

The existing implementation is pretty questionable from what I've seen, which is quite worrying. For example, it's possible to use a non-starter as a first character of an identifier. That may not even render depending on the text renderer, or if it does it'll be a box or even worse, merge into punctuation or whitespace.

Ignoring the normalization problems and not tracking the current standard; between the white list mangling, and non-starter start character for identifier, it needs work regardless of this PR.

@WalterBright
Copy link
Member

I'm a bit confused. I don't see how identifier characters affect name mangling or the linker at all. The mangled identifier names are prefixed with a count, the contents do not matter at all. Identifiers are in the object file as zero-terminated strings - that shouldn't affect the linker at all, either.

@rikkimax
Copy link
Contributor Author

rikkimax commented Jul 2, 2023

I wish you were right.

isUniAlpha(c);

I would prefer to remove these checks entirely. But Iain wants a set of checks to remain cos of assemblers. So control + whitespace + punctuation would be black listed is my current plan.

@WalterBright
Copy link
Member

Check out where it's used. The first is in pragma(mangle, "string") doing a sanity check on the string, the second is in doGNUABITagSemantic() which I think is to check C++ name mangling. Neither of those need to limit the character set of mangled names or names put in the object file.

gdc and ldc may have limits on identifier character sets, but dmd does not.

@rikkimax
Copy link
Contributor Author

rikkimax commented Jul 3, 2023

Dmd has a whitelist currently as you say. I wanted to remove them.

@ibuclaw brought the support receipts so he wants certain things like null terminators to be denied.

This one is between you two. My proposal is a blacklist with control characters + whitespace + punctuation, this will be a good solution long term.

You two should talk and come to a decision about what I do about it.

@rikkimax
Copy link
Contributor Author

rikkimax commented Jul 4, 2023

Ok, given the recent policy adjustment for deprecations, I need to know if we will be continuing this approach or if we are switching to the quick check algorithm like GCC does? So that only problematic identifiers warn/error, instead of perfectly fine ones.

nothing should be deprecated unless there's a very compelling reason ("this feature was a mistake and people shouldn't be using it" is not a compelling reason)

@WalterBright

@rikkimax
Copy link
Contributor Author

I've been thinking about the cost of actually doing the quick check algorithm and I know how to do it with only the cost of one extra table and if the check occurs, it could be incredibly cheap!

First, Unicode characters only use 29 bits of a dchar, which leaves 3 we can use. The bottom-most bit can be used to represent the yes/no/maybe value. At runtime this should only require a single left shift at the start of the search function.

Next we need the CCC value for a given character, since we only store what is in our identifier ranges this should hopefully allow a lot of savings in ROM space with a multi-layer trie.

There would need to be two sets of logic, the first happens in the lookup function for start/continue. A simple comparison for the ccc value and the last one, and the yes/no/maybe value being set with an or.

In the caller, there would be two additional variables (passed by ref). Is not normalized, and the last ccc value. From there the cli arg to pick silent accept/warn/error behavior could occur.

What this means is we don't have to get rid of the non-starters and risk breaking code. It's entirely possible to keep them without slowing things down too much! For example, a single character ASCII identifier should only need to check the variable for if not normalized. If multi-characters, it only does the logic if it succeeds.

For reference the Java algorithm, minus the UTF-16 specific stuff:

public int quickCheck(String source) {
    short lastCanonicalClass = 0;
    int result = YES;
    for (int i = 0; i < source.length(); ++i) {
        int ch = source.codepointAt(i);
        short canonicalClass = getCanonicalClass(ch);
        if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
            return NO;
        }
        int check = isAllowed(ch);
        if (check == NO) return NO;
        if (check == MAYBE) result = MAYBE;
        lastCanonicalClass = canonicalClass;
    }
    return result;
}

public static final int NO = 0, YES = 1, MAYBE = -1;

@rikkimax rikkimax force-pushed the c23-identifiers branch 5 times, most recently from d0710b9 to aed6bf3 Compare February 4, 2024 15:43
@rikkimax
Copy link
Contributor Author

rikkimax commented Feb 4, 2024

OOM again even with @rainers's fix.

However, I have a theory. The increase for dmd ~5mb to 9mb is nowhere near enough to be triggering OOM's as much as it is.

But... there is a possibility here, that paralleled running of the test suite, could make it add up.

So I've gone ahead and swapped the default for the test runner to be same number of as cpus instead of double.

@rikkimax
Copy link
Contributor Author

rikkimax commented Feb 4, 2024

A quick look over the runners:

https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories

16gb of ram with 4 cores for Windows, 14gb ram and 3 cores on OSX. That is 4gb and 4.6gb of ram per test.

That is a lot of ram to be getting eaten up to cause OOM's.

However, what I'm doing is clearly not working.

@kinke
Copy link
Contributor

kinke commented Feb 4, 2024

Note that this seems to be a new CI regression now, happening way more consistently across the board (master pipelines, other PRs), and now concerning a compiler testsuite test (runnable/testprofile.d), not the previous sporadically-failing druntime tests (edit: and not running in parallel with those). Not sure when it started. The test itself seems not to allocate anything really: https://github.com/dlang/dmd/blob/master/compiler/test/runnable/testprofile.d

So I've gone ahead and swapped the default for the test runner to be same number of as cpus instead of double.

IIRC, we explicitly specify the parallelism (not relying on defaults) for most CI runs, e.g.,

./run --environment --jobs=$N "${targets[@]}" "${args[@]}" CC="$CC"

@rikkimax
Copy link
Contributor Author

rikkimax commented Feb 4, 2024

Yeah.

I went and decreased the cpu count manually in the scripts as well for N.

I'm currently in d_do_test and ran the failing one manually, whatever is going on I haven't been able to find what is going on nor find anything eating gigabytes of ram.

Let me know and I'll rebase (with these CI related changes removed), not worth bothering to do it until a fix is found.

@rikkimax rikkimax marked this pull request as ready for review February 5, 2024 13:27
@rikkimax
Copy link
Contributor Author

rikkimax commented Feb 9, 2024

@WalterBright apart from the CI coverage upload failing, CI is green, so waiting on you again.

@WalterBright WalterBright merged commit dffd899 into dlang:master Mar 18, 2024
48 checks passed
///
unittest
{
assert(isAnyContinue('ğ'));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you wanted to test isAnyIdentifierCharacter here.
isAnyContinue is defined and tested later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

@benjones
Copy link
Contributor

@rikkimax
Copy link
Contributor Author

rikkimax commented Mar 22, 2024 via email

@benjones
Copy link
Contributor

My guess is that the object code generation was never tested on symbols with weird unicode stuff before? Seems like only 32 bit windows stuff is broken, I think for obj files (so probably something in backend/dobj.d). I'll try to take a look when I get a chance, but am not sure what I'd be looking at...

@rikkimax
Copy link
Contributor Author

rikkimax commented Mar 22, 2024 via email

@rikkimax
Copy link
Contributor Author

rikkimax commented Mar 22, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants