-
-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement C23 identifiers via UAX31 (minus normalization) #15307
Conversation
Thanks for your pull request and interest in making D better, @rikkimax! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please see CONTRIBUTING.md for more information. If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment. Bugzilla referencesYour PR doesn't reference any Bugzilla issue. If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog. Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub run digger -- build "master + dmd#15307" |
4b8ffa0
to
db60b5d
Compare
Only ocean failing to CI, same error in another PR so not something I did. One thing we may want to consider is supporting C11 and C99 ranges specifically as opt-in options for ImportC. |
I've had some more time to think about this. Until I've done this, this PR will break existing code. However, the rest of the code still needs a review @WalterBright. EDIT: all of the doc.d stuff is wrong as well. However we may want a combined start/continue set of tables for this. |
isValidMangling is not only there for linkers. Assemblers have special/reserved characters that should never go in a symbol name - for example |
Okay, we'll need a special table for this. Unless I'm told what the character ranges should be (ideally in the form of a UAX31 profile), I'll have to revert to the legacy table (which is a bad thing long term). |
Without it, you'd also be permitted to have |
Fair enough, I'm convinced. I just need to know what the profile is to implement (as that isn't related to D identifiers). See: https://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers |
I had a quick peek at the universal character set tables in gcc, they only store the last part of the unicode character - I assume they must have some other routine that validates the value of the first part. It looks like not all values seem to match what you've auto-generated here? https://github.com/gcc-mirror/gcc/blob/a07dadba85f1b15e270c227dfa70e2fdf331494f/libcpp/ucnid.h#L55 |
NXX23 will be C23.
Second to last column will be ccc (0 is starter).
Last is character only.
The functions in it are for the quick check algorithm.
Only way for my tables to be fundamentally wrong is the not/intercept
methods on ValueRanges. A lot of it is provided by Unicode.
…On Wed, Jun 14, 2023, 02:59 Iain Buclaw ***@***.***> wrote:
I had a quick peek at the universal character set tables in gcc, they only
store the last part of the unicode character - I assume they must have some
other routine that validates the value of the first part.
It looks like not all values seem to match what you've auto-generated here?
https://github.com/gcc-mirror/gcc/blob/a07dadba85f1b15e270c227dfa70e2fdf331494f/libcpp/ucnid.h#L55
—
Reply to this email directly, view it on GitHub
<#15307 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHSL47JFEWFEX36HKHMZGDXLB53JANCNFSM6AAAAAAZCC3F54>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I've been thinking a bit about this PR for a couple of things.
|
d2ce90a
to
6e35f85
Compare
@deadalnix has been on the ball for this and done a very good optimized table lookup based upon this PR for SDC. |
@rikkimax this is good work. I can't help but think, though, what is the problem being solved? It's 3000 lines, do any of our users need it? |
Most of the code is in the tables and is 100% generated. They would be a whole lot smaller if I didn't have to split them up into starters/non-starters + legacy (which would be deprecated regardless). There is also alpha tables that'll be removed once I do the mangling/ddoc identifier implementation (yes they are different!). My main concern is not just what we need now but in 10-20 years time (which this would easily cover), or the ranges for ImportC C99/C11 to be pickable (not in PR, but I have a plan if we want that). But there are some serious problems with our existing implementation. Like how the mangling checks are a whitelist that is the same as the D identifiers, so you can't even use a C23 identifier right now if it isn't in our legacy ranges that may not even be compliant C99! |
The existing implementation is pretty questionable from what I've seen, which is quite worrying. For example, it's possible to use a non-starter as a first character of an identifier. That may not even render depending on the text renderer, or if it does it'll be a box or even worse, merge into punctuation or whitespace. Ignoring the normalization problems and not tracking the current standard; between the white list mangling, and non-starter start character for identifier, it needs work regardless of this PR. |
I'm a bit confused. I don't see how identifier characters affect name mangling or the linker at all. The mangled identifier names are prefixed with a count, the contents do not matter at all. Identifiers are in the object file as zero-terminated strings - that shouldn't affect the linker at all, either. |
I wish you were right. dmd/compiler/src/dmd/dmangle.d Line 80 in 6e35f85
I would prefer to remove these checks entirely. But Iain wants a set of checks to remain cos of assemblers. So control + whitespace + punctuation would be black listed is my current plan. |
Check out where it's used. The first is in gdc and ldc may have limits on identifier character sets, but dmd does not. |
Dmd has a whitelist currently as you say. I wanted to remove them. @ibuclaw brought the support receipts so he wants certain things like null terminators to be denied. This one is between you two. My proposal is a blacklist with control characters + whitespace + punctuation, this will be a good solution long term. You two should talk and come to a decision about what I do about it. |
Ok, given the recent policy adjustment for deprecations, I need to know if we will be continuing this approach or if we are switching to the quick check algorithm like GCC does? So that only problematic identifiers warn/error, instead of perfectly fine ones.
|
I've been thinking about the cost of actually doing the quick check algorithm and I know how to do it with only the cost of one extra table and if the check occurs, it could be incredibly cheap! First, Unicode characters only use 29 bits of a dchar, which leaves 3 we can use. The bottom-most bit can be used to represent the yes/no/maybe value. At runtime this should only require a single left shift at the start of the search function. Next we need the CCC value for a given character, since we only store what is in our identifier ranges this should hopefully allow a lot of savings in ROM space with a multi-layer trie. There would need to be two sets of logic, the first happens in the lookup function for start/continue. A simple comparison for the ccc value and the last one, and the yes/no/maybe value being set with an or. In the caller, there would be two additional variables (passed by ref). Is not normalized, and the last ccc value. From there the cli arg to pick silent accept/warn/error behavior could occur. What this means is we don't have to get rid of the non-starters and risk breaking code. It's entirely possible to keep them without slowing things down too much! For example, a single character ASCII identifier should only need to check the variable for if not normalized. If multi-characters, it only does the logic if it succeeds. For reference the Java algorithm, minus the UTF-16 specific stuff: public int quickCheck(String source) {
short lastCanonicalClass = 0;
int result = YES;
for (int i = 0; i < source.length(); ++i) {
int ch = source.codepointAt(i);
short canonicalClass = getCanonicalClass(ch);
if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
return NO;
}
int check = isAllowed(ch);
if (check == NO) return NO;
if (check == MAYBE) result = MAYBE;
lastCanonicalClass = canonicalClass;
}
return result;
}
public static final int NO = 0, YES = 1, MAYBE = -1; |
d0710b9
to
aed6bf3
Compare
OOM again even with @rainers's fix. However, I have a theory. The increase for dmd ~5mb to 9mb is nowhere near enough to be triggering OOM's as much as it is. But... there is a possibility here, that paralleled running of the test suite, could make it add up. So I've gone ahead and swapped the default for the test runner to be same number of as cpus instead of double. |
aed6bf3
to
416874b
Compare
A quick look over the runners: 16gb of ram with 4 cores for Windows, 14gb ram and 3 cores on OSX. That is 4gb and 4.6gb of ram per test. That is a lot of ram to be getting eaten up to cause OOM's. However, what I'm doing is clearly not working. |
Note that this seems to be a new CI regression now, happening way more consistently across the board (master pipelines, other PRs), and now concerning a compiler testsuite test (
IIRC, we explicitly specify the parallelism (not relying on defaults) for most CI runs, e.g., dmd/.azure-pipelines/windows.sh Line 145 in ab45c77
|
Yeah. I went and decreased the cpu count manually in the scripts as well for I'm currently in Let me know and I'll rebase (with these CI related changes removed), not worth bothering to do it until a fix is found. |
416874b
to
615b07a
Compare
@WalterBright apart from the CI coverage upload failing, CI is green, so waiting on you again. |
615b07a
to
bad6c00
Compare
bad6c00
to
49fdfb1
Compare
49fdfb1
to
35435c8
Compare
/// | ||
unittest | ||
{ | ||
assert(isAnyContinue('ğ')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you wanted to test isAnyIdentifierCharacter
here.
isAnyContinue
is defined and tested later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
Seeing test failures that I assume are related to this on recent PRs: https://dev.azure.com/dlanguage/dmd/_build/results?buildId=41950&view=logs&jobId=8af2f344-529d-5953-c6e9-e9e8d82c041d&j=8af2f344-529d-5953-c6e9-e9e8d82c041d&t=007a24f2-5f9f-5584-8be8-7de5c1d9b6f5 |
I don't think it is my fault.
The failure is in writing the object file which I haven't touched.
Which is a very interesting error for us to be getting.
…On Sat, 23 Mar 2024, 05:18 Ben Jones, ***@***.***> wrote:
Seeing test failures that I assume are related to this on recent PRs:
https://dev.azure.com/dlanguage/dmd/_build/results?buildId=41950&view=logs&jobId=8af2f344-529d-5953-c6e9-e9e8d82c041d&j=8af2f344-529d-5953-c6e9-e9e8d82c041d&t=007a24f2-5f9f-5584-8be8-7de5c1d9b6f5
—
Reply to this email directly, view it on GitHub
<#15307 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHSL47G54CKGXYSHDCZFPLYZRKWBAVCNFSM6AAAAAAZCC3F56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGQ2DEMJTGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
My guess is that the object code generation was never tested on symbols with weird unicode stuff before? Seems like only 32 bit windows stuff is broken, I think for obj files (so probably something in backend/dobj.d). I'll try to take a look when I get a chance, but am not sure what I'd be looking at... |
I think that you may be thinking about the wrong place.
I'm in bed so I can't dive into it right now, but I would look at file
writing, not generation.
But yes there are problems with Unicode in symbols on windows, but that
shows up with linking which this isn't doing.
If there was a problem with generation I would have expected a problem with
UAX31 ranges, what is failing is c99 which is what we were defined against.
…On Sat, 23 Mar 2024, 06:16 Ben Jones, ***@***.***> wrote:
My guess is that the object code generation was never tested on symbols
with weird unicode stuff before? Seems like only 32 bit windows stuff is
broken, I think for obj files (so probably something in backend/dobj.d).
I'll try to take a look when I get a chance, but am not sure what I'd be
looking at...
—
Reply to this email directly, view it on GitHub
<#15307 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHSL42SN6P4KNHLYTLD4XDYZRRPFAVCNFSM6AAAAAAZCC3F56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGU2DMMJXGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Actually ya know what this could be?
File already exists error that due to concurrency is failing randomly.
It is because I named tests for both c and d the same name.
Give them the language prefix, and it should resolve it. Docs need this
gotcha added too.
…On Sat, 23 Mar 2024, 06:24 rikki cattermole, ***@***.***> wrote:
I think that you may be thinking about the wrong place.
I'm in bed so I can't dive into it right now, but I would look at file
writing, not generation.
But yes there are problems with Unicode in symbols on windows, but that
shows up with linking which this isn't doing.
If there was a problem with generation I would have expected a problem
with UAX31 ranges, what is failing is c99 which is what we were defined
against.
On Sat, 23 Mar 2024, 06:16 Ben Jones, ***@***.***> wrote:
> My guess is that the object code generation was never tested on symbols
> with weird unicode stuff before? Seems like only 32 bit windows stuff is
> broken, I think for obj files (so probably something in backend/dobj.d).
> I'll try to take a look when I get a chance, but am not sure what I'd be
> looking at...
>
> —
> Reply to this email directly, view it on GitHub
> <#15307 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAHSL42SN6P4KNHLYTLD4XDYZRRPFAVCNFSM6AAAAAAZCC3F56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGU2DMMJXGQ>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
$(LI $(I UAX31): UAX31) | ||
$(LI $(I c99): C99) | ||
$(LI $(I c11): C11) | ||
$(LI $(I all): All, the least restrictive set, which comes all others (default)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 'comes all others' mean? I'm no native speaker, but it doesn't make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I missed out a word "with".
Okay the situation currently is significantly worse than even I had known.
We are only doing is alpha checks and no differentiation between start/continue. We were NOT c99 compliant.
Also the legacy start ranges turn out to be quite big (439).
This work was already discussed with @WalterBright, and this PR does not make us C23 compliant as per Walter's agreement on approach. I have added deprecation comments for the logic that will go away eventually as per Walter's agreement for the approach (I set it to 2.110 and 2.120 but those are pretty random in choice, so if there are better ones please say).
The approach that was agreed upon is to remove the non-starter characters. In doing so you enforce normalization form C. Alternatively we could implement the quick check algorithm (or even normalization) and acquire C23 compliance. However both deprecation/removal and implementation of normalization can follow in another PR. This one does not break any code, only adds ranges.
Changelog/spec PR will come after feedback from Walter.
EDIT: This PR's approach has since changed to match the Unicode Annex 31 standard as specified by the C23 standard. It is a subset thereof which only includes the tables. Normalization is not handled here. That needs follow up PR's.
TLDR: This PR changes D's identifiers to be foundationally more stable and match other compilers by using the Unicode Annex 31 standard. It also makes ImportC more compliant with C11 standard while offering more configurability if needed on picking the Identifier tables.