New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctify Identifier definition to conform to Unicode standard in "Lexical structure" #305
Comments
Why not simplify this all, and just use what teh actual impl does to define the spec rule? i.e. the impl has: https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/InternalUtilities/UnicodeCharacterUtilities.cs#L12 and https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/InternalUtilities/UnicodeCharacterUtilities.cs#L46 Summarizing them, it looks like we have the following: A C# identifier can start with any character from After the starting character a C# identifier can be followed by any characters from that above set, or any character that the underlying platform APIs consider: UnicodeCategory.DecimalDigitNumber, UnicodeCategory.ConnectorPunctuation, UnicodeCategory.Format, UnicodeCategory.NonSpacingMark, UnicodeCategory.SpacingCombiningMark, UnicodeCategory.LetterNumber. Would that not be sufficient enogh? In a simpler grammar form this would simply be: identifier:
| identifier_or_keyword <but not keyword>
;
identifier_or_keyword:
| identifier_start identifier_part+
identifier_start:
| _
| character from Lu, Ll, Lt, Lm, Lo, or Nl unicode class**
;
identifier_part:
| identifier_start
| character from Nd, Pc, Mn, Mc, or Cf unicode class**
character:
| source_text_character
| escape
;
** Unicode determination is out of the scope of the specification.
Compliant impls are free to defer to the underlying platform to make this determination.
escape:
| yadda yadda
; We'd just explain what the compiler does for |
Why is that not acceptable? I think it's reasonable to simply say that an implementation isn't forced into anything here.
What value does this bring to customers? Do we have customers asking for this?
This seems like a large amount of unnecessary complexity. For all intents and purposes the impl that Roslyn has is the specificaiton for what lexical identifiers are accepted. We should just doc it and use that. If there are other impls out there, they'll want to follow this anyways, so all this additional 'profile' stuff just seems like unneeded complexity.
If i'm reading that right, that's 2000 lines that would need to be incorporated into the impl. That's a lot of complexity esp in the testing realm. I'm not seeing the value here. Why does this need to be done? |
Hello @CyrusNajmabadi . The related questions that you posted to Roslyn issue # 37566 I have answered there.
|
If I understand correctly all your non-conformance issues raised so far are in terms of UAX31, and could be addressed by stating the intent that the compiler conforms to the standard except for UAX31. That probably wouldn't be true anyway. Different companies deliberately and for good reasons choose to implement different algorithms than presented in the various annexes. Bidirectional handling is a prominent example in the case of Microsoft.
This assumption seems to be the source of all the issues. As of my current understanding, no one ever stated that is the case. It might very well be, but others suggested there isn't a compelling reason for that. As @gafter suggested, do you have an identifier that should be recognized as an identifier but is not, or vice versa? Giving few concrete examples might help people understand the differences we are talking about. As far as I am concerned, I personally don't mind if the compiler team chooses to follow UAX 31 or a different set of rules, for any reasons they find appropriate (I care much more about the ability to use characters out of BMP). I am open to reconsideration if there are convincing arguments, but I haven't heard any practical benefits yet, only the ability to claim full conformance in the specification. |
Does not compute :-) If the impl matches the spec, then it's correct. So if we say that the spec is whatever the imple currently is, the both are 'correct' :-) |
There is certainly value. Where the disagreement comes in is in how much value there actually is and whether it is worth the enormous costs here both for the spec and the impl. This stuff isn't measured in a vacuum. All of these costs must be considered and they need to be weighed against the value gained. If the bang/buck is not there, it is not valuable for the team (Wich is 100% time and user constrained) to invest here. It effectively would actually take away from so many other pieces of work that really would be useful and valuable for the ecosystem as a whole to be focused on. I appreciate your passion in this topic. But one thing to realize is that not every single passion translates it to meaning it is actually important for the language/impl |
I believe the point @srutzky is trying to make is that the spec, by claiming Unicode support, says it conforms to UAX 31 and the impl does match that claim. |
The language spec is specific about what an identifier is, and it refers to some definitions in the Unicode specification. It does not refer to UAX 31 for the definition of identifier. |
Comments on issues listed above from the current C# 6 draft:
This has been addressed. The other two issues were addressed using the remedy proposed by @CyrusNajmabadi : The character classes allowed for identifiers (including the additional restrictions for the first letter) are defined in the ANTLR grammar for Identifiers. I believe this can be closed, but I'll transfer to the dotnet/csharpstandard repo for the committee as a whole to determine that. |
See also dotnet/roslyn#9731 and dotnet/roslyn#13560 for shortcomings of existing implementations that may need to be considered when handling this issue. |
Background
When the C# language specification was written (that which eventually became ECMA-334, 1st Edition, it incorporated the rules for "Identifiers" as defined in "The Unicode Standard, Version 3.0" (published January, 2000). That definition (for the most part), is as follows (taken from the C# language specification; I added the comments translating to the General Categories):
Since then, the compiler has been updated to use definitions from newer versions of Unicode. This has resulted in not only more characters being available for identifiers, but in some cases has also resulted in some breaking changes, as noted below (taken from here ):
Please see Notes section regarding code point U+30FB
Issues
The "Lexical structure" document still states the original version of Unicode that was used back in 2000, even though the compiler has been updated to use a newer version of Unicode. The following is found immediately below the ANTLR definition of Identifiers:
The "Identifier" definition was never the official definition, which in Unicode 3.0 (the version that the spec states is being used, or at least was used at that time) was:
The net result is the same (well, except that the C# spec added the customization of the "Low Line" character -- i.e. underscore -- as a valid starting character). However, including
identifier_part
is an unnecessary introduction of non-standard terminology (as far as I can tell, it was only ever found in the "PropList.txt" file). Unless there is a technical reason to deviate, the C# specification should use the official definition(s) as stated in Unicode Technical Report dotnet/csharplang#31: IDENTIFIER AND PATTERN SYNTAX (though it's not the syntax shown directly above, more on this in a moment).While characters do sometimes get re-classified into different General Categories, and definitions of derived properties sometimes change, neither of those events should ever lead to a breaking change with regards to identifiers. Unicode has a stated policy of preventing such things. The following quote is from the "Unicode Character Encoding Stability Policies" document
They even included a section on stability starting in revision 5 of TR #31, back in 2005. This section shows that once a character is permitted in identifiers, it will always be permitted in future versions of the Unicode Standard.
So, what happened? On the surface it might appear to be a simple case of the definition(s) changing. For example:
<identifier_start>
was updated to includeOther_ID_Start = true
.<identifier_extend>
was changed to<identifier_continue>
, and the definition of<identifier_continue>
was updated to includeOther_ID_Continue = true
.XID_Start
andXID_Continue
.XID_Start
andXID_Continue
as "preferred" overID_Start
andID_Continue
.--[:Pattern_Syntax:]--[:Pattern_White_Space:]
.But the real issue is that, due to such changes across versions of the Unicode Standard, attempting to derive the correct list of characters based on the rules and Categories is quite error prone, and even more so now that the rules are no longer based on just General Categories. Well, Unicode 3.0 does not seem to define how to handle determining which characters have a particular derived property, so it makes sense that the given formula would be used. However, starting no later than in version 5.1 (in 2008), the Unicode Standard, in Unicode Standard Annex dotnet/csharplang#44: UNICODE CHARACTER DATABASE, states explicitly that implementations should a) use the provided list of characters, and b) not derive the list based on the provided algorithm. The link for UAX # 44 in the previous sentence goes to revision 2 (for Unicode 5.1 in 2008), but the following quote is taken from revision 8 (for Unicode 6.1 in 2012) since it was re-worded to be clearer and was the first to include an example (a very pertinent one, in fact), and has not changed since, at least not through revision 24 (emphasis added):
Remedies
The following needs to happen in order to conform to the Unicode Standard:
State the actual version of the Unicode Standard that the rules and definitions are being taken from. Stating something like "a newer version" is not acceptable.
State which requirements the compiler will follow:
There is a list of requirements. The C# specification probably won't meet all of them, but it doesn't need to.
Change the definition of "Identifiers" to use
XID_Start
andXID_Continue
.There aren't many differences between the "
ID_
" and "XID_
" versions. For Unicode 12.1,XID_Start
has 23 fewer characters thanID_Start
, andXID_Continue
has only 19 fewer characters thanID_Continue
.Define a "profile" to properly describe extensions (i.e. customizations) made to the "Default Identifier Syntax". Such customizations would include:
identifier_start
ID_Start
that were removed inXID_Start
that might cause this to be a breaking change.ID_Continue
that were removed inXID_Continue
that might cause this to be a breaking change.New definition of Identifiers (at bare minimum; not including any characters being added back from non-
X
definitions for backwards compatibility) should be something like:Incorporate the
XID_Start
andXID_Continue
lists from the "DerivedCoreProperties.txt" file, available on Unicode.org via public FTP. Here is the link to the file for the "latest" version of the Unicode Standard, but it is possible that an earlier version is used. If that is the case, then please be sure to use the intended version of the file (which is stated at the top of the file):ftp://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt
Please do not attempt to use the rules / formula to derive the list of characters.
Notes
It was discussed by a working group on 2000-03-02:
It was originally categorized as "Pc", at least in versions 3.0 through 4.0.1
It qualified as being in the "identifier_extend" list due to being "Pc", at least in versions 3.0 and 3.0.1
It was included in the
ID_Continue
list (in the DerivedCoreProperties.txt file, introduced in version 3.1), at least in versions 3.1 through 4.0.1Its general category changed from "Pc" to "Po" in Unicode version 4.1 (not version 6.0)
It was no longer included in the
ID_Continue
list starting in version 4.1For some reason (which I have not yet discovered) it was not included in the
Other_ID_Continue
"Contributory" property (introduced in version 4.1; in the PropList.txt file) which is designed to support backwards compatibility and contains "grandfathered" characters that had once been in theID_Continue
list but are no longer in the current version.Recategorization alone cannot be the reason that it's no longer valid due to a) it should be in the
Other_ID_Continue
list, and b) being "Po" isn't a guarantee of exclusion due to "U+00B7 MIDDLE DOT" being inXID_Continue
(starting in version 3.1, as well as it and "U+0387 GREEK ANO TELEIA" being in bothID_Continue
andXID_Continue
starting in Unicode version 5.1, viaOther_ID_Continue
).It can easily be added via customization, and is even in the "Candidate Characters for Inclusion in Identifiers" list (Table 3) in section "2.4 Specific Character Adjustments" of UAX # 31, starting in revision 10 (for Unicode version 5.2, 2009).
Summary
The end result must be a static list of valid identifier characters that does not "depend on the underlying platform for their Unicode behavior". It is perfectly acceptable for methods such as
CharUnicodeInfo.GetUnicodeCategory()
to reflect newer versions of Unicode upon updates to .NET. However, it is not acceptable for any such changes to be reflected in which characters are valid in identifiers. Any changes in the future would need to be an updated static list combined with an updated specification that indicates the updated Unicode version being used.For example, SQL Server uses the
ID_Start
andID_Continue
definitions from Unicode 3.2 (plus a few customizations, and minus any support for supplementary characters). This is stated in the documentation and has been consistent across at least 7 versions of SQL Server and updates to the underlying OS. While Unicode 3.2 is quite old and it would certainly be nice to have the definitions updated, it is at least an otherwise proper implementation.Related Issues
The text was updated successfully, but these errors were encountered: