-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compiler does not correctly interpret surrogate pairs when used in an identifier #9731
Comments
Is this a regression? |
@AlekseyTs, not a regression. It repro's back to the v2.0 native compiler. I don't expect this is a priority (especially since it hasn't been reported or discovered outside a random lunch discussion). |
I did run into this issue in the past (with different characters). The reason I did not push it through was because the C# specification (version 5.0 anyway) actually says
That is the only Unicode standard mentioned in references. Your character was not added to Unicode until 5.0. In fact, there were no characters above 0xFFFF in version 3.0. On the other hand,
does not say anything about identifiers. And while it might be implied that 𒅴 is perhaps not allowed explicitly, the way I read the specification suggests that By the way, Some characters have been reclassified to different categories over time, it is not clear to me whether the specification suggests whether the state at version 3.0 must be used. |
@miloush
|
@ufcpp it would be nonsense to do anything else, I am just not sure whether that's what the language specification suggests; might be worth updating it. |
This is also reported as #13474 |
The language specification for C# (both the online version and the ECMA version) permit characters larger than 16 bits in identifiers, either using a Unicode escape sequence or a surrogate pair of code points. It is a bug that the compiler does not accept them. |
In order to fix this for VB too, we'd need an API to lowercase a UTF32 code point. See https://github.com/dotnet/corefx/issues/30879 |
The needed API has been added: see https://github.com/dotnet/corefx/issues/30879#issuecomment-445990453 and https://apisof.net/catalog/System.Text.Rune.ToLowerInvariant(Rune) |
.NET Core only so can't that API for a while. |
@jaredpar I expect we'd use the code that Stephen suggested in the shared library in C# until that someday, if those APIs are available. |
The C# specification states that an
identifier
can start with or contain anything matchingletter-character
, which is defined as:However, the compiler does not appear to correctly interpret some characters which match the above categories if they are part of a surrogate pair.
For example the sumerian character
𒅴
is categorized as 'OtherLetter' (matching 'Lo' above) when processed throughchar.GetUnicodeCategory("𒅴", 0)
.However, the compiler is interpreting this character as two separate characters (and reporting CS1056 for both). It is likely checking each character individually, rather than checking if the first character is part of a surrogate pair and interpreting the character appropriately if it is.
The text was updated successfully, but these errors were encountered: