New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode rules for determining valid Identifier characters are obsolete and need to be replaced #37566
Comments
Although the Unicode standard does have a stability guarantee, Unicode has in fact had breaking changes in the specification for an identifier. We follow the Unicode specification, and therefore we have had breaking changes. They are rare and obscure. The Unicode specification defines identifiers in terms of categories, and the C# specification follows suit. Since the specification for identifiers is in terms of categories, so is the implementation. If you have specific problems to report with the set of accepted identifiers or character classification, please report them. We cannot address a generalized assertion that there are bugs without knowing specifically what those bugs are. |
It's highly unlikely The Roslyn implementation will roll its own solution here. It's going to use whatever the underlying platform it runs on does. If that underlying platform has bugs, so be it. If anything, we should simply update the specification to make this clear.
This is something you can report to the .net framework over at dotnet/corefx. If they change their impl then Roslyn will pick it up automatically. |
The benefit of that 'correctness' doesn't seem worth the cost. The current impl is simple, cheap and defers question Roslyn doesn't care about to the platform (which does). It should be easy to describe what the impl does and make that what the spec says should happen. The spec can and should also indicate that it's acceptable for a conformant implementation to use underlying platform APIs. And, given that, that there might be bugs elevated from the underlying platform to the impl because of that. |
@srutzky This seems to be third issue opened about unicode and the C# langauge and roslyn compiler. Might i suggest we take a different approach here. It would be good to first come to an agreement on what the goals of the C# language (and the roslyn impl) are in terms of supporting unicode and what the rules should be around its identifiers. When we can come to a common agreement, then we can determine what sort of work (if any) needs to happen. Right now it appears ot me that you have a strong opinion that C# should be spec'ed and should work a certain way. However, there hasn't been work to come to a common opinion on what that is. This means the proposals and issues come across as an insistence to change things that others may not thing need changing. |
Hello @gafter (and @CyrusNajmabadi , @miloush , @ufcpp , and others). I realize that I came in, never having interacted with this project before, and submitted a few rather lengthy issues. And, not only are they long, they're also on a topic that this community (or this group within the community that is working on / following this particular topic) feels is pretty much under control. I didn't mean to put people on the defensive (especially those that have put in years working on this project) by submitting 3 issues in quick succession that don't exactly align with the common understanding of this topic. My goal here really is just to help improve the state of Unicode support within C# (and anywhere else I come across). I have spent a lot of time over the past 5 or so years studying / researching Unicode (and collations, encodings, etc). And, the more I learn the more I find that Unicode is greatly misunderstood, so I try to correct inaccuracies whenever I run into them. I have recently updated escape-sequence pages for C#, F#, and JavaScript (all part of this project: Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters) ). That's when I found the C# specification in GitHub. I had noticed years ago that some of the info was outdated, but I had no opportunity to help until I found this repository. With regards to the statements and suggestions I've made across these 3 issues (and one PR), they are not enhancement requests for things that I think would be great for the language or are in need of. Implementing these will provide little, if any, practical benefit to me. These items are actually bugs in terms of conforming to the Unicode Standard. I am a huge advocate of accuracy, and so I want this specification and its implementations to truly be in compliance with the Unicode Standard, instead of just using a Unicode encoding (UTF-16) and various character properties. The benefit is to the integrity of the C# language, via its specification and implementations. I know nobody is going to make changes to an area that they don't believe needs changing simply because I suggested them, and given my desire for accuracy, I certainly don't want to recommend changes that are either unnecessary or incorrect. So, I've spent a considerable amount of time looking through old revisions of the Unicode Specification and technical reports, old versions of the Unicode Character Database, and even Unicode list-serve archives. I found official documentation to support all of my claims, including finding the year and/or Unicode version in which a particular policy changed or was enacted. All of my research can be found in Correctify Identifier definition to conform to Unicode standard in "Lexical structure" #2698, which contains links back to the source material for anyone who wants to verify. I am not trying to be difficult about any of this. Again, making no changes does not negatively impact me. I am simply reporting my findings based on verifiable research. Sure, I might have missed something, or misread something. But, any counter-claims should be backed up with citations of official documentation (current version, or version matching the Unicode version supported by the specification and compiler if there are differences).
I think I understand better now the source of the confusion. What you are describing is, for the most part, Unicode 3.x, and was phased out between versions 4.0 and 4.1, yet C# 6 is using at least version 6.0 of Unicode:
Again, I am just trying to help bring this specification and its compilers into full compliance with the Unicode Standard. Take care, Solomon.. |
Hello @CyrusNajmabadi
Sure, the specification can be updated to include the information. But, if this is the direction that the community wants to go in, then I don't see how it would be possible to claim that C# either "supports" or "conforms to" the Unicode Standard. Using the underlying platform not only prevents stability, but it can't incorporate the Contributory properties:
Ok, fair enough. Thanks for that tip. I will start that adventure next :-).
Well, I am almost positive that the "cost" of relying upon the underlying platform is to prevent conformance to the Unicode Standard. I would argue that Roslyn (or any compiler / system claiming to conform to the Unicode Standard) does care about this particular question because it is Roslyn that is making the claim. If Roslyn claims to support Unicode version X, then it should verifiably adhere to the rules and state of the UCD for version X. I believe some leeway is given in the conformance requirements such that an implementation can specify either a clearly defined list of valid code points or invalid code points. This allows for unassigned code points to behave in a predictable manner as they are assigned in future versions of Unicode. But, what is considered "working" for a particular version of an implementation cannot be a moving target, at least not in any but the loosest of interpretations of the requirements (so loose that it's highly likely to be incorrect).
Understood, and those are good points. My initial goal was simply to clean up the specification (csharplang issue #2672), which again represents no changes to any implementation / compiler. The identifier-related issue (csharplang issue #2698, which is the basis of this issue) is secondary, but no less real. Yes, I agree that we / the community needs to come to a common understanding of the current state of things as well as what the desired state should be. I'm not sure how to best go about that outside of continuing discussion on csharplang issue #2698, and so would certainly appreciate some help / guidance 😸. I will, again, clarify that while I do clearly have strong opinions 🙃, I am not insisting on anything outside of:
Thoughts? Thanks, Solomon... |
Simple. It can just say: it supports/conforms to the standard to the best of it's ability based on the platform it runs on. It's not hard for a specification to carve out this sort of room for implementation differences. |
Sure. But there are differing levels of 'care'. That's why I have asked how many are truly impacted by this, and what the impact of any change here would be. So far it seems so incredibly close to zero to be considered just completely neglible. So, the amount of 'care' corresponds to that. It's not zero, but it's close. -- Note, this is how all issues are treated. They have to be evaluated in terms of their impact to the ecosystem as a whole, not how important it is to any one person. |
The impact of Katkana Middle Dot breaking change in Japan was very small as a result. Most people wasn't aware of it because very few people use non-roman-alphabetic characters for identifiers. Especially, non-letter characters tends to be avoided. |
The compiler follows the language specification, which refers to the Unicode categories. It does not refer to the Unicode specification for an identifier in the way that you say it should. Requests for changes to the specification belong in |
As per all of the reasoning detailed in the following "C# language specification" issue:
Correctify Identifier definition to conform to Unicode standard in "Lexical structure" #2698
the rules / methods of determining what is a valid Identifier character in the following file:
https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/InternalUtilities/UnicodeCharacterUtilities.cs
are obsolete (have been since Unicode 4.0, back in 2003) and need to be replaced with a static list of valid identifier characters (preferably the
XID_Start
andXID_Continue
lists).The C# language specification and Roslyn compiler are both very outdated in their approach to implementing the Unicode Standard and thus cannot truly claim to support Unicode. Both can claim to use characters and catagorizations from the Unicode Character Database, but neither can claim to be compliant with the Standard.
The current method / approach, replaced 16 years ago, has two major flaws:
It allows for breaking changes. This is not possible for a system that truly supports Unicode. The Unicode Standard has a stability guarantee that ensures all Identifier characters are backwards compatible, regardless of possible future re-categorization. Some functionality should be sensitive to updating the language to use newer versions of Unicode, but not Identifiers. This is why a static list is used. It might be bulky, but it is at least correct.
It does not account for bugs in the .NET implementation of the categories. Meaning, the code in UnicodeCharacterUtilities.cs uses .NET methods to determine if characters are in a particular catagory. Bugs in that underlying code can potentially mis-categorize characters. For example, I did find (and reported) years ago a flaw in the logic that resulted in mis-categorizing circled letters (the write-up is no longer available, ever since MS Connect was shut-down). Neither the language nor the compiler should subject themselves to such potential mistakes (and in fact, this is part of the reasoning provided by Unicode for using only the provided lists: attempting to derive based on the rules is error prone).
The text was updated successfully, but these errors were encountered: