Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode rules for determining valid Identifier characters are obsolete and need to be replaced #37566

Closed
srutzky opened this issue Jul 30, 2019 · 10 comments

Comments

@srutzky
Copy link

srutzky commented Jul 30, 2019

As per all of the reasoning detailed in the following "C# language specification" issue:

Correctify Identifier definition to conform to Unicode standard in "Lexical structure" #2698

the rules / methods of determining what is a valid Identifier character in the following file:

https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/InternalUtilities/UnicodeCharacterUtilities.cs

are obsolete (have been since Unicode 4.0, back in 2003) and need to be replaced with a static list of valid identifier characters (preferably the XID_Start and XID_Continue lists).

The C# language specification and Roslyn compiler are both very outdated in their approach to implementing the Unicode Standard and thus cannot truly claim to support Unicode. Both can claim to use characters and catagorizations from the Unicode Character Database, but neither can claim to be compliant with the Standard.

The current method / approach, replaced 16 years ago, has two major flaws:

  1. It allows for breaking changes. This is not possible for a system that truly supports Unicode. The Unicode Standard has a stability guarantee that ensures all Identifier characters are backwards compatible, regardless of possible future re-categorization. Some functionality should be sensitive to updating the language to use newer versions of Unicode, but not Identifiers. This is why a static list is used. It might be bulky, but it is at least correct.

  2. It does not account for bugs in the .NET implementation of the categories. Meaning, the code in UnicodeCharacterUtilities.cs uses .NET methods to determine if characters are in a particular catagory. Bugs in that underlying code can potentially mis-categorize characters. For example, I did find (and reported) years ago a flaw in the logic that resulted in mis-categorizing circled letters (the write-up is no longer available, ever since MS Connect was shut-down). Neither the language nor the compiler should subject themselves to such potential mistakes (and in fact, this is part of the reasoning provided by Unicode for using only the provided lists: attempting to derive based on the rules is error prone).

@gafter
Copy link
Member

gafter commented Jul 30, 2019

Although the Unicode standard does have a stability guarantee, Unicode has in fact had breaking changes in the specification for an identifier. We follow the Unicode specification, and therefore we have had breaking changes. They are rare and obscure. The Unicode specification defines identifiers in terms of categories, and the C# specification follows suit. Since the specification for identifiers is in terms of categories, so is the implementation.

If you have specific problems to report with the set of accepted identifiers or character classification, please report them. We cannot address a generalized assertion that there are bugs without knowing specifically what those bugs are.

@CyrusNajmabadi
Copy link
Member

It does not account for bugs in the .NET implementation of the categories. Meaning, the code in UnicodeCharacterUtilities.cs uses .NET methods to determine if characters are in a particular catagory. Bugs in that underlying code can potentially mis-categorize characters. For example, I did find (and reported) years ago a flaw in the logic that resulted in mis-categorizing circled letters (the write-up is no longer available, ever since MS Connect was shut-down). Neither the language nor the compiler should subject themselves to such potential mistakes

It's highly unlikely The Roslyn implementation will roll its own solution here. It's going to use whatever the underlying platform it runs on does. If that underlying platform has bugs, so be it. If anything, we should simply update the specification to make this clear.

(and in fact, this is part of the reasoning provided by Unicode for using only the provided lists: attempting to derive based on the rules is error prone).

This is something you can report to the .net framework over at dotnet/corefx. If they change their impl then Roslyn will pick it up automatically.

@CyrusNajmabadi
Copy link
Member

This is why a static list is used. It might be bulky, but it is at least correct.

The benefit of that 'correctness' doesn't seem worth the cost. The current impl is simple, cheap and defers question Roslyn doesn't care about to the platform (which does). It should be easy to describe what the impl does and make that what the spec says should happen.

The spec can and should also indicate that it's acceptable for a conformant implementation to use underlying platform APIs. And, given that, that there might be bugs elevated from the underlying platform to the impl because of that.

@CyrusNajmabadi
Copy link
Member

@srutzky This seems to be third issue opened about unicode and the C# langauge and roslyn compiler. Might i suggest we take a different approach here. It would be good to first come to an agreement on what the goals of the C# language (and the roslyn impl) are in terms of supporting unicode and what the rules should be around its identifiers. When we can come to a common agreement, then we can determine what sort of work (if any) needs to happen. Right now it appears ot me that you have a strong opinion that C# should be spec'ed and should work a certain way.

However, there hasn't been work to come to a common opinion on what that is. This means the proposals and issues come across as an insistence to change things that others may not thing need changing.

@srutzky
Copy link
Author

srutzky commented Aug 26, 2019

Hello @gafter (and @CyrusNajmabadi , @miloush , @ufcpp , and others). I realize that I came in, never having interacted with this project before, and submitted a few rather lengthy issues. And, not only are they long, they're also on a topic that this community (or this group within the community that is working on / following this particular topic) feels is pretty much under control. I didn't mean to put people on the defensive (especially those that have put in years working on this project) by submitting 3 issues in quick succession that don't exactly align with the common understanding of this topic.

My goal here really is just to help improve the state of Unicode support within C# (and anywhere else I come across). I have spent a lot of time over the past 5 or so years studying / researching Unicode (and collations, encodings, etc). And, the more I learn the more I find that Unicode is greatly misunderstood, so I try to correct inaccuracies whenever I run into them. I have recently updated escape-sequence pages for C#, F#, and JavaScript (all part of this project: Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters) ). That's when I found the C# specification in GitHub. I had noticed years ago that some of the info was outdated, but I had no opportunity to help until I found this repository.

With regards to the statements and suggestions I've made across these 3 issues (and one PR), they are not enhancement requests for things that I think would be great for the language or are in need of. Implementing these will provide little, if any, practical benefit to me. These items are actually bugs in terms of conforming to the Unicode Standard. I am a huge advocate of accuracy, and so I want this specification and its implementations to truly be in compliance with the Unicode Standard, instead of just using a Unicode encoding (UTF-16) and various character properties. The benefit is to the integrity of the C# language, via its specification and implementations.

I know nobody is going to make changes to an area that they don't believe needs changing simply because I suggested them, and given my desire for accuracy, I certainly don't want to recommend changes that are either unnecessary or incorrect. So, I've spent a considerable amount of time looking through old revisions of the Unicode Specification and technical reports, old versions of the Unicode Character Database, and even Unicode list-serve archives. I found official documentation to support all of my claims, including finding the year and/or Unicode version in which a particular policy changed or was enacted. All of my research can be found in Correctify Identifier definition to conform to Unicode standard in "Lexical structure" #2698, which contains links back to the source material for anyone who wants to verify.

I am not trying to be difficult about any of this. Again, making no changes does not negatively impact me. I am simply reporting my findings based on verifiable research. Sure, I might have missed something, or misread something. But, any counter-claims should be backed up with citations of official documentation (current version, or version matching the Unicode version supported by the specification and compiler if there are differences).


Although the Unicode standard does have a stability guarantee, Unicode has in fact had breaking changes in the specification for an identifier. We follow the Unicode specification, and therefore we have had breaking changes. They are rare and obscure. The Unicode specification defines identifiers in terms of categories, and the C# specification follows suit. Since the specification for identifiers is in terms of categories, so is the implementation.

I think I understand better now the source of the confusion. What you are describing is, for the most part, Unicode 3.x, and was phased out between versions 4.0 and 4.1, yet C# 6 is using at least version 6.0 of Unicode:

  • Unicode has in fact had breaking changes in the specification for an identifier.

    No, at least not in the past 14 years. Unicode version 4.0 (2003) added Other_ID_Start, and version 4.1 (2005) added Other_ID_Continue, as mechanisms for guaranteeing backwards compatibility. "Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax (version 4.1)" states (emphasis added):

    2.4 Backward Compatibility

    ...
    The Other_ID_Start and Other_ID_Continue properties are thus designed to ensure that the Unicode identifier specification is backward compatible: Any sequence of characters that qualified as an identifier in some version of Unicode will continue to qualify as an identifier in future versions.

    The only exception that I am aware of is "U+30FB KATAKANA MIDDLE DOT". I have not yet found any specific references to this change with regards to identifiers. But, this character's general category did change from "Pc" to "Po" in Unicode version 4.1 (not version 6.0), which is also when it fell out of the ID_Continue property. However, that recategorization alone cannot be the reason that it's no longer valid due to a) it should be in the Other_ID_Continue set, and b) being "Po" isn't a guarantee of exclusion due to "U+00B7 MIDDLE DOT" being in XID_Continue (in version 4.1, as well as it and "U+0387 GREEK ANO TELEIA" being in both ID_Continue and XID_Continue in newer versions of Unicode).

  • We follow the Unicode specification

    No, this specification and implementation do not conform to the Unicode specification, at least not according to the requirements defined in the Unicode Standard, UAX31. This is what I am trying to help correct.

  • therefore we have had breaking changes

    Outside of "U+30FB KATAKANA MIDDLE DOT", the breaking changes are due to relying upon an obsolete approach to determining valid identifier characters, an approach that does not match any version of the Unicode Standard starting with 4.0.

  • The Unicode specification defines identifiers in terms of categories

    It used to. However, starting in Unicode version 4.0 (2003), the definition includes both categories and "Contributory" properties, and those properties are not available in either the Char or CharUnicodeInfo classes. And, even if those properties were provided by the underlying platform, they added language in Unicode version 5.1 to be quite explicit about not using the definitions to derive any particular list (i.e. not just the identifier lists).

Again, I am just trying to help bring this specification and its compilers into full compliance with the Unicode Standard.

Take care, Solomon..

@srutzky
Copy link
Author

srutzky commented Aug 26, 2019

Hello @CyrusNajmabadi

It's highly unlikely The Roslyn implementation will roll its own solution here. It's going to use whatever the underlying platform it runs on does. If that underlying platform has bugs, so be it. If anything, we should simply update the specification to make this clear.

Sure, the specification can be updated to include the information. But, if this is the direction that the community wants to go in, then I don't see how it would be possible to claim that C# either "supports" or "conforms to" the Unicode Standard. Using the underlying platform not only prevents stability, but it can't incorporate the Contributory properties: Other_ID_Start, Other_ID_Continue, Pattern_White_Space, and Pattern_Syntax.

(and in fact, this is part of the reasoning provided by Unicode for using only the provided lists: attempting to derive based on the rules is error prone).

This is something you can report to the .net framework over at dotnet/corefx. If they change their impl then Roslyn will pick it up automatically.

Ok, fair enough. Thanks for that tip. I will start that adventure next :-).

This is why a static list is used. It might be bulky, but it is at least correct.

The benefit of that 'correctness' doesn't seem worth the cost. The current impl is simple, cheap and defers question Roslyn doesn't care about to the platform (which does). It should be easy to describe what the impl does and make that what the spec says should happen.

The spec can and should also indicate that it's acceptable for a conformant implementation to use underlying platform APIs. And, given that, that there might be bugs elevated from the underlying platform to the impl because of that.

Well, I am almost positive that the "cost" of relying upon the underlying platform is to prevent conformance to the Unicode Standard. I would argue that Roslyn (or any compiler / system claiming to conform to the Unicode Standard) does care about this particular question because it is Roslyn that is making the claim. If Roslyn claims to support Unicode version X, then it should verifiably adhere to the rules and state of the UCD for version X. I believe some leeway is given in the conformance requirements such that an implementation can specify either a clearly defined list of valid code points or invalid code points. This allows for unassigned code points to behave in a predictable manner as they are assigned in future versions of Unicode. But, what is considered "working" for a particular version of an implementation cannot be a moving target, at least not in any but the loosest of interpretations of the requirements (so loose that it's highly likely to be incorrect).

It would be good to first come to an agreement on what the goals of the C# language (and the roslyn impl) are in terms of supporting Unicode and what the rules should be around its identifiers. When we can come to a common agreement, then we can determine what sort of work (if any) needs to happen. Right now it appears to me that you have a strong opinion that C# should be spec'ed and should work a certain way.

However, there hasn't been work to come to a common opinion on what that is. This means the proposals and issues come across as an insistence to change things that others may not think need changing.

Understood, and those are good points. My initial goal was simply to clean up the specification (csharplang issue #2672), which again represents no changes to any implementation / compiler. The identifier-related issue (csharplang issue #2698, which is the basis of this issue) is secondary, but no less real. Yes, I agree that we / the community needs to come to a common understanding of the current state of things as well as what the desired state should be. I'm not sure how to best go about that outside of continuing discussion on csharplang issue #2698, and so would certainly appreciate some help / guidance 😸. I will, again, clarify that while I do clearly have strong opinions 🙃, I am not insisting on anything outside of:

  1. accepting the current state of conformance (i.e. that neither the C# spec nor the implementations conform to the Unicode Standard, at least not beyond version 3.2),
  2. choosing a path that is best for the project as a whole and being honest about it (i.e. either support the Unicode Standard by meeting the conformance requirements or do not claim to be supporting the Unicode Standard), and
  3. supporting any counter-claims with official Unicode documentation (of the target supported version of Unicode)

Thoughts?

Thanks, Solomon...

@CyrusNajmabadi
Copy link
Member

Sure, the specification can be updated to include the information. But, if this is the direction that the community wants to go in, then I don't see how it would be possible to claim that C# either "supports" or "conforms to" the Unicode Standard.

Simple. It can just say: it supports/conforms to the standard to the best of it's ability based on the platform it runs on.

It's not hard for a specification to carve out this sort of room for implementation differences.

@CyrusNajmabadi
Copy link
Member

I would argue that Roslyn (or any compiler / system claiming to conform to the Unicode Standard) does care about this particular question

Sure. But there are differing levels of 'care'. That's why I have asked how many are truly impacted by this, and what the impact of any change here would be. So far it seems so incredibly close to zero to be considered just completely neglible.

So, the amount of 'care' corresponds to that. It's not zero, but it's close.

--

Note, this is how all issues are treated. They have to be evaluated in terms of their impact to the ecosystem as a whole, not how important it is to any one person.

@ufcpp
Copy link
Contributor

ufcpp commented Aug 26, 2019

The impact of Katkana Middle Dot breaking change in Japan was very small as a result. Most people wasn't aware of it because very few people use non-roman-alphabetic characters for identifiers. Especially, non-letter characters tends to be avoided.
I used to ask 200> community members in my tech session "Do you want to use surrogate pairs in your codes or not." The majority is "No". Almost all menbers wasn't interested in Kana・Kanji (CJk Ideographics) identifiers. Rather, some people were disappointed that egyptian hierogriphs can't be used. Off cource, this is a joke. Demand for Katakana Middle Dot is nearly the same as hierogriphs.
If many people were interested, I was going to make a pull-requst to support surrogate pairs. But the reality was that I didn't make it because of small demand. Moreover, netstandard2.0 doesn't have performant API for getting category of surrogate pair characters.

@gafter
Copy link
Member

gafter commented Aug 27, 2019

The compiler follows the language specification, which refers to the Unicode categories. It does not refer to the Unicode specification for an identifier in the way that you say it should. Requests for changes to the specification belong in csharplang. Since you are unable to identify any specific incorrect behavior of the compilers, I am closing this.

@gafter gafter closed this as completed Aug 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants