Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctify Identifier definition to conform to Unicode standard in "Lexical structure" #305

Open
srutzky opened this issue Jul 29, 2019 · 10 comments

Comments

@srutzky
Copy link

srutzky commented Jul 29, 2019

Background

When the C# language specification was written (that which eventually became ECMA-334, 1st Edition, it incorporated the rules for "Identifiers" as defined in "The Unicode Standard, Version 3.0" (published January, 2000). That definition (for the most part), is as follows (taken from the C# language specification; I added the comments translating to the General Categories):

identifier_or_keyword
    : identifier_start_character identifier_part_character*
    ;

identifier_start_character
    : letter_character         /* General Categories: Lu, Ll, Lt, Lm, Lo, or Nl */
    | '_'
    ;

identifier_part_character
    : letter_character         /* General Categories: Lu, Ll, Lt, Lm, Lo, or Nl */
    | decimal_digit_character  /* General Category: Nd */
    | connecting_character     /* General Category: Pc */
    | combining_character      /* General Categories: Mn or Mc */
    | formatting_character     /* General Category: Cf */
    ;

Since then, the compiler has been updated to use definitions from newer versions of Unicode. This has resulted in not only more characters being available for identifiers, but in some cases has also resulted in some breaking changes, as noted below (taken from here ):

Unicode Version Change in C# 6

The Roslyn compilers depend on the underlying platform for their Unicode behavior. As a practical matter, that means that the new compiler will reflect changes in the Unicode standard.

For example, the Unicode Katakana Middle Dot "・" (U+30FB) no longer works in identifiers in C# 6. Its Unicode class was Pc (Punctuation, Connector) in Unicode 5.1 or older, but it changed to Po (Punctuation, Other) in Unicode 6.0.

Please see Notes section regarding code point U+30FB

Issues

  1. The "Lexical structure" document still states the original version of Unicode that was used back in 2000, even though the compiler has been updated to use a newer version of Unicode. The following is found immediately below the ANTLR definition of Identifiers:

    For information on the Unicode character classes mentioned above, see The Unicode Standard, Version 3.0, section 4.5.

  2. The "Identifier" definition was never the official definition, which in Unicode 3.0 (the version that the spec states is being used, or at least was used at that time) was:

    <identifier> ::= <identifier_start> (<identifier_start> | <identifier_extend>)*
    
    <identifier_start> ::= <Characters with General Category of: Lu, Ll, Lt, Lm, Lo, or Nl>
    <identifier_extend> ::= <Characters with General Category of: Mn, Mc, Nd, Pc, or Cf>

    The net result is the same (well, except that the C# spec added the customization of the "Low Line" character -- i.e. underscore -- as a valid starting character). However, including identifier_part is an unnecessary introduction of non-standard terminology (as far as I can tell, it was only ever found in the "PropList.txt" file). Unless there is a technical reason to deviate, the C# specification should use the official definition(s) as stated in Unicode Technical Report dotnet/csharplang#31: IDENTIFIER AND PATTERN SYNTAX (though it's not the syntax shown directly above, more on this in a moment).

  3. While characters do sometimes get re-classified into different General Categories, and definitions of derived properties sometimes change, neither of those events should ever lead to a breaking change with regards to identifiers. Unicode has a stated policy of preventing such things. The following quote is from the "Unicode Character Encoding Stability Policies" document

    Identifier Stability

    Applicable Version: Unicode 3.0+

    All strings that are valid default Unicode identifiers will continue to be valid default Unicode identifiers in all subsequent versions of Unicode. Furthermore, default identifiers never contain characters with the Pattern_Syntax or Pattern_White_Space properties.

    If a string qualifies as an identifier under one version of Unicode, it will qualify as an identifier under all future versions. The reverse is not true—an identifier under Version 5.0 may not be an identifier under Version 4.0—it may contain a character that was unassigned under Unicode 4.0, or (very rarely) a Unicode 4.0 character that was not an identifier character in Unicode 4.0, but became one in Unicode 5.0

    They even included a section on stability starting in revision 5 of TR #31, back in 2005. This section shows that once a character is permitted in identifiers, it will always be permitted in future versions of the Unicode Standard.

    So, what happened? On the surface it might appear to be a simple case of the definition(s) changing. For example:

    1. in Unicode 4.0 (in 2003) the definition of <identifier_start> was updated to include Other_ID_Start = true.
    2. In Unicode 4.1 (in 2004) the term <identifier_extend> was changed to <identifier_continue>, and the definition of <identifier_continue> was updated to include Other_ID_Continue = true.
    3. In Unicode 4.1 (in 2005) added derived properties: XID_Start and XID_Continue.
    4. In Unicode 5.0 (in 2006) elevated XID_Start and XID_Continue as "preferred" over ID_Start and ID_Continue.
    5. In Unicode 5.1 (in 2008) changed the definitions slightly to include --[:Pattern_Syntax:]--[:Pattern_White_Space:].

    But the real issue is that, due to such changes across versions of the Unicode Standard, attempting to derive the correct list of characters based on the rules and Categories is quite error prone, and even more so now that the rules are no longer based on just General Categories. Well, Unicode 3.0 does not seem to define how to handle determining which characters have a particular derived property, so it makes sense that the given formula would be used. However, starting no later than in version 5.1 (in 2008), the Unicode Standard, in Unicode Standard Annex dotnet/csharplang#44: UNICODE CHARACTER DATABASE, states explicitly that implementations should a) use the provided list of characters, and b) not derive the list based on the provided algorithm. The link for UAX # 44 in the previous sentence goes to revision 2 (for Unicode 5.1 in 2008), but the following quote is taken from revision 8 (for Unicode 6.1 in 2012) since it was re-worded to be clearer and was the first to include an example (a very pertinent one, in fact), and has not changed since, at least not through revision 24 (emphasis added):

    Implementations should simply use the derived properties, and should not try to rederive them from lists of simple properties and collections of rules, because of the chances for error and divergence when doing so.

    Definitions of property derivations are provided for information only, typically in comment fields in the data files. Such definitions may be refactored, refined, or corrected over time. These definitions are presented in a modified set notation, expressed as set additions and/or subtractions of various other property values. For example:

    # Derived Property: ID_Start
    #  Characters that can start an identifier.
    #  Generated from:
    #      Lu + Ll + Lt + Lm + Lo + Nl
    #    + Other_ID_Start
    #    - Pattern_Syntax
    #    - Pattern_White_Space
    

    When interpreting definitions of derived properties of this sort, keep in mind that set subtraction is not a commutative operation. Thus "Lo + Lm - Pattern_Syntax" defines a different set than "Lo - Pattern_Syntax + Lm". The order of property set operations stated in the definitions affects the composition of the derived set.

    If there are any cases of mismatches between the definition of a derived property as listed in DerivedCoreProperties.txt or similar data files in the UCD, and the definition of a derived property as a set definition rule, the explicit listing in the data file should always be taken as the normative definition of the property. As described in Stability of Releases the property listing in the data files for any given version of the standard will never change for that version.

Remedies

The following needs to happen in order to conform to the Unicode Standard:

  1. State the actual version of the Unicode Standard that the rules and definitions are being taken from. Stating something like "a newer version" is not acceptable.

    UAX31-C1. An implementation claiming conformance to this specification shall identify the version of this specification.

  2. State which requirements the compiler will follow:

    UAX31-C2. An implementation claiming conformance to this specification shall describe which of the following requirements it observes: ...

    There is a list of requirements. The C# specification probably won't meet all of them, but it doesn't need to.

  3. Change the definition of "Identifiers" to use XID_Start and XID_Continue.

    The XID_Start and XID_Continue properties are improved lexical classes that incorporate the changes described in _Section 5.1, NFKC Modifications_ . They are recommended for most purposes, especially for security, over the original ID_Start and ID_Continue properties.

    There aren't many differences between the "ID_" and "XID_" versions. For Unicode 12.1, XID_Start has 23 fewer characters than ID_Start, and XID_Continue has only 19 fewer characters than ID_Continue.

  4. Define a "profile" to properly describe extensions (i.e. customizations) made to the "Default Identifier Syntax". Such customizations would include:

    1. Adding "Low Line" (U+005F; i.e. underscore) to identifier_start
    2. Potentially adding any characters back in from ID_Start that were removed in XID_Start that might cause this to be a breaking change.
    3. Potentially adding any characters back in from ID_Continue that were removed in XID_Continue that might cause this to be a breaking change.
  5. New definition of Identifiers (at bare minimum; not including any characters being added back from non-X definitions for backwards compatibility) should be something like:

    identifier_or_keyword
        : identifier_start_character identifier_continue_character*
        ;
    
    identifier_start_character
        : <any character having the derived property of XID_Start>
        | '_'
        ;
    
    identifier_continue_character
        : <any character having the derived property of XID_Continue>
        ;
  6. Incorporate the XID_Start and XID_Continue lists from the "DerivedCoreProperties.txt" file, available on Unicode.org via public FTP. Here is the link to the file for the "latest" version of the Unicode Standard, but it is possible that an earlier version is used. If that is the case, then please be sure to use the intended version of the file (which is stated at the top of the file):

    ftp://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt

    Please do not attempt to use the rules / formula to derive the list of characters.

Notes

  1. Regarding code point U+30FB (Katakana Middle Dot) as noted in the "Unicode Version Change in C# 6" quote above (final item in the Background section): this particular code point appears to be an anomaly and so probably shouldn't be used as evidence of breaking changes. Here is what I discovered about U+30FB:
    • It was discussed by a working group on 2000-03-02:

      U+30FB KATAKANA MIDDLE DOT is a different animal entirely. It is connecting punctuation, used to bind together two-part (or multiple-part) katakana representations of foreign words or names. In a programming context, it can be seen as functioning something like the use of "_" in C to form a multi_word_identifier, for example. The Japanese explicitly requested that it be allowed in identifiers.

    • It was originally categorized as "Pc", at least in versions 3.0 through 4.0.1

    • It qualified as being in the "identifier_extend" list due to being "Pc", at least in versions 3.0 and 3.0.1

    • It was included in the ID_Continue list (in the DerivedCoreProperties.txt file, introduced in version 3.1), at least in versions 3.1 through 4.0.1

    • Its general category changed from "Pc" to "Po" in Unicode version 4.1 (not version 6.0)

    • It was no longer included in the ID_Continue list starting in version 4.1

    • For some reason (which I have not yet discovered) it was not included in the Other_ID_Continue "Contributory" property (introduced in version 4.1; in the PropList.txt file) which is designed to support backwards compatibility and contains "grandfathered" characters that had once been in the ID_Continue list but are no longer in the current version.

    • Recategorization alone cannot be the reason that it's no longer valid due to a) it should be in the Other_ID_Continue list, and b) being "Po" isn't a guarantee of exclusion due to "U+00B7 MIDDLE DOT" being in XID_Continue (starting in version 3.1, as well as it and "U+0387 GREEK ANO TELEIA" being in both ID_Continue and XID_Continue starting in Unicode version 5.1, via Other_ID_Continue).

    • It can easily be added via customization, and is even in the "Candidate Characters for Inclusion in Identifiers" list (Table 3) in section "2.4 Specific Character Adjustments" of UAX # 31, starting in revision 10 (for Unicode version 5.2, 2009).

Summary

The end result must be a static list of valid identifier characters that does not "depend on the underlying platform for their Unicode behavior". It is perfectly acceptable for methods such as CharUnicodeInfo.GetUnicodeCategory() to reflect newer versions of Unicode upon updates to .NET. However, it is not acceptable for any such changes to be reflected in which characters are valid in identifiers. Any changes in the future would need to be an updated static list combined with an updated specification that indicates the updated Unicode version being used.

For example, SQL Server uses the ID_Start and ID_Continue definitions from Unicode 3.2 (plus a few customizations, and minus any support for supplementary characters). This is stated in the documentation and has been consistent across at least 7 versions of SQL Server and updates to the underlying OS. While Unicode 3.2 is quite old and it would certainly be nice to have the definitions updated, it is at least an otherwise proper implementation.

Related Issues

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Jul 29, 2019

But the real issue is that, due to such changes across versions of the Unicode Standard, attempting to derive the correct list of characters based on the rules and Categories is quite error prone,

Why not simplify this all, and just use what teh actual impl does to define the spec rule? i.e. the impl has: https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/InternalUtilities/UnicodeCharacterUtilities.cs#L12 and https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/InternalUtilities/UnicodeCharacterUtilities.cs#L46

Summarizing them, it looks like we have the following:

A C# identifier can start with any character from [a-zA-Z_] or any character that the underlying platform APIs consider: UnicodeCategory.UppercaseLetter, UnicodeCategory.LowercaseLetter, UnicodeCategory.TitlecaseLetter, UnicodeCategory.ModifierLetter, UnicodeCategory.OtherLetter, UnicodeCategory.LetterNumber. See https://docs.microsoft.com/en-us/dotnet/api/system.globalization.unicodecategory?view=netframework-4.8 for more details.

After the starting character a C# identifier can be followed by any characters from that above set, or any character that the underlying platform APIs consider: UnicodeCategory.DecimalDigitNumber, UnicodeCategory.ConnectorPunctuation, UnicodeCategory.Format, UnicodeCategory.NonSpacingMark, UnicodeCategory.SpacingCombiningMark, UnicodeCategory.LetterNumber.

Would that not be sufficient enogh? In a simpler grammar form this would simply be:

identifier:
  | identifier_or_keyword <but not keyword>
  ;

identifier_or_keyword:
  | identifier_start identifier_part+

identifier_start:
  | _
  | character from Lu, Ll, Lt, Lm, Lo, or Nl unicode class**
  ;

identifier_part:
  | identifier_start
  | character from Nd, Pc, Mn, Mc, or Cf unicode class**

character:
  | source_text_character
  | escape
  ;

** Unicode determination is out of the scope of the specification.
Compliant impls are free to defer to the underlying platform to make this determination.

escape:
   | yadda yadda
   ;

We'd just explain what the compiler does for escape as well.

@CyrusNajmabadi
Copy link
Member

Stating something like "a newer version" is not acceptable.

Why is that not acceptable? I think it's reasonable to simply say that an implementation isn't forced into anything here.

Change the definition of "Identifiers" to use XID_Start and XID_Continue.

What value does this bring to customers? Do we have customers asking for this?

Define a "profile" to properly describe extensions (i.e. customizations) made to the "Default Identifier Syntax". Such customizations would include:

This seems like a large amount of unnecessary complexity. For all intents and purposes the impl that Roslyn has is the specificaiton for what lexical identifiers are accepted. We should just doc it and use that. If there are other impls out there, they'll want to follow this anyways, so all this additional 'profile' stuff just seems like unneeded complexity.

Incorporate the XID_Start and XID_Continue lists from the "DerivedCoreProperties.txt" file, available on Unicode.org via public FTP. Here is the link to the file for the "latest" version of the Unicode Standard, but it is possible that an earlier version is used. If that is the case, then please be sure to use the intended version of the file (which is stated at the top of the file):

If i'm reading that right, that's 2000 lines that would need to be incorporated into the impl. That's a lot of complexity esp in the testing realm. I'm not seeing the value here. Why does this need to be done?

@srutzky
Copy link
Author

srutzky commented Aug 26, 2019

Hello @CyrusNajmabadi . The related questions that you posted to Roslyn issue # 37566 I have answered there.

  • Why not simplify this all, and just use what the actual impl does to define the spec rule?

    Because what the actual implementation does is incorrect according to the rules set forth in the Unicode Standard, UAX31 (especially starting with version 4.0, 2003). I could potentially be wrong in my reading of it, but I don't see any possible interpretation that would allow for relying upon the underlying platform and/or the possibility of allowing breaking changes, at least not when taking into account the entirety of that specification (i.e. UAX31).

  • Stating something like "a newer version" is not acceptable.

    Why is that not acceptable? I think it's reasonable to simply say that an implementation isn't forced into anything here.

    Because the very first conformance rule is:

    UAX31-C1. An implementation claiming conformance to this specification shall identify the version of this specification.

    This project (well, all projects that claim to conform to the Unicode Standard) needs to set a level of expectation with users of it such that it's saying, "this is how C# works", instead of, "this might be how C# works".

  • Change the definition of "Identifiers" to use XID_Start and XID_Continue.

    What value does this bring to customers? Do we have customers asking for this?

    I don't know if anyone is asking for it, or if anyone even knows to ask for it. But the value is increased security (info should be in "UTR36: Unicode Security Considerations" and "UTS39: Unicode Security Mechanisms"). Though it's possible that this also requires that identifiers first be normalized. I need to research that part a little more. Either way, this has been the main recommendation from Unicode starting in version 5.0 (in 2006).

  • Define a "profile" to properly describe extensions (i.e. customizations) made to the "Default Identifier Syntax". Such customizations would include:

    This seems like a large amount of unnecessary complexity. For all intents and purposes the impl that Roslyn has is the specification for what lexical identifiers are accepted. We should just doc it and use that. If there are other impls out there, they'll want to follow this anyways, so all this additional 'profile' stuff just seems like unneeded complexity.

    Yes, you could just document the current implementation. However, that leaves you with a system that cannot claim to support the Unicode Standard. So, any added complexity is actually necessary (which is understandably frustrating).

  • Incorporate the XID_Start and XID_Continue lists from the "DerivedCoreProperties.txt" file, available on Unicode.org via public FTP. Here is the link to the file for the "latest" version of the Unicode Standard, but it is possible that an earlier version is used. If that is the case, then please be sure to use the intended version of the file (which is stated at the top of the file):

    If i'm reading that right, that's 2000 lines that would need to be incorporated into the impl. That's a lot of complexity esp in the testing realm. I'm not seeing the value here. Why does this need to be done?

    Well, the XID_Start list in Unicode version 12.1 is 1240 lines, some being individual code points and some being ranges. I'm sure there are different means of implementation, but I don't see how this increases testing complexity. The way I see it, by working within the recommendations of the Unicode Standard, testing complexity is actually reduced since you are removing an external dependency (i.e. the underlying platform) that may or may not support certain characters.

    The value of following these recommendations is being able to accurately claim full support of the Unicode Standard (of a particular version, of course). There is also value in providing users of the implementations / compilers a stable environment, that all characters in source code files that work in this particular version of C# / compiler will continue to work across changes in the underlying platform such that you don't need warnings like the one quoted at the end of the Background section above.

    "Need" is a matter of perspective. I am not making any demands here, only recommendations. As I mentioned in response to your related questions on the Roslyn issue, I just want C# and its implementations to accurately support the Unicode Standard such that the Unicode Consortium cannot claim that any part of this does not and hence any statements claiming that it does need to be removed.

    To be clear, the first issue I posted, Correctify Unicode-related errors and omissions in "Lexical structure" #304 must be done as the current state of the specification is just wrong. That is a technical matter, not opinion, and again represents no actual changes to approach or implementation. It simply makes the description of how C# works accurate (with respect to Unicode). But, this issue (2698) and the resulting Roslyn issue (37566) should be done, for the integrity of the project.

@miloush
Copy link

miloush commented Aug 26, 2019

If I understand correctly all your non-conformance issues raised so far are in terms of UAX31, and could be addressed by stating the intent that the compiler conforms to the standard except for UAX31.

That probably wouldn't be true anyway. Different companies deliberately and for good reasons choose to implement different algorithms than presented in the various annexes. Bidirectional handling is a prominent example in the case of Microsoft.

the C# specification should use the official definition(s) as stated in Unicode Technical Report dotnet/csharplang#31: IDENTIFIER AND PATTERN SYNTAX

This assumption seems to be the source of all the issues. As of my current understanding, no one ever stated that is the case. It might very well be, but others suggested there isn't a compelling reason for that.

As @gafter suggested, do you have an identifier that should be recognized as an identifier but is not, or vice versa? Giving few concrete examples might help people understand the differences we are talking about.

As far as I am concerned, I personally don't mind if the compiler team chooses to follow UAX 31 or a different set of rules, for any reasons they find appropriate (I care much more about the ability to use characters out of BMP). I am open to reconsideration if there are convincing arguments, but I haven't heard any practical benefits yet, only the ability to claim full conformance in the specification.

@CyrusNajmabadi
Copy link
Member

Because what the actual implementation does is incorrect

Does not compute :-)

If the impl matches the spec, then it's correct. So if we say that the spec is whatever the imple currently is, the both are 'correct' :-)

@CyrusNajmabadi
Copy link
Member

There is also value in providing users of the implementations / compilers a stable environment, that all characters in source code files that work in this particular version of C# / compiler will continue to work across changes in the underlying platform

There is certainly value. Where the disagreement comes in is in how much value there actually is and whether it is worth the enormous costs here both for the spec and the impl.

This stuff isn't measured in a vacuum. All of these costs must be considered and they need to be weighed against the value gained.

If the bang/buck is not there, it is not valuable for the team (Wich is 100% time and user constrained) to invest here. It effectively would actually take away from so many other pieces of work that really would be useful and valuable for the ecosystem as a whole to be focused on.

I appreciate your passion in this topic. But one thing to realize is that not every single passion translates it to meaning it is actually important for the language/impl

@miloush
Copy link

miloush commented Aug 26, 2019

If the impl matches the spec, then it's correct.

I believe the point @srutzky is trying to make is that the spec, by claiming Unicode support, says it conforms to UAX 31 and the impl does match that claim.

@gafter
Copy link
Member

gafter commented Aug 26, 2019

The language spec is specific about what an identifier is, and it refers to some definitions in the Unicode specification. It does not refer to UAX 31 for the definition of identifier.

@BillWagner
Copy link
Member

Comments on issues listed above from the current C# 6 draft:

  1. The "Lexical structure" document still states the original version of Unicode that was used back in 2000,

This has been addressed.

The other two issues were addressed using the remedy proposed by @CyrusNajmabadi : The character classes allowed for identifiers (including the additional restrictions for the first letter) are defined in the ANTLR grammar for Identifiers.

I believe this can be closed, but I'll transfer to the dotnet/csharpstandard repo for the committee as a whole to determine that.

@BillWagner BillWagner transferred this issue from dotnet/csharplang Apr 30, 2021
@gafter
Copy link
Member

gafter commented May 11, 2022

See also dotnet/roslyn#9731 and dotnet/roslyn#13560 for shortcomings of existing implementations that may need to be considered when handling this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants