Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler does not correctly interpret surrogate pairs when used in an identifier #9731

Open
tannergooding opened this issue Mar 14, 2016 · 13 comments
Assignees
Labels
Area-Compilers Bug help wanted The issue is "up for grabs" - add a comment if you are interested in working on it Language-C# Tenet-Localization Some piece of UI isn’t localized, often due to hard-coding of strings or other visible elements.
Milestone

Comments

@tannergooding
Copy link
Member

The C# specification states that an identifier can start with or contain anything matching letter-character, which is defined as:

letter-character::
A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
A unicode-escape-sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl

However, the compiler does not appear to correctly interpret some characters which match the above categories if they are part of a surrogate pair.

For example the sumerian character 𒅴 is categorized as 'OtherLetter' (matching 'Lo' above) when processed through char.GetUnicodeCategory("𒅴", 0).

However, the compiler is interpreting this character as two separate characters (and reporting CS1056 for both). It is likely checking each character individually, rather than checking if the first character is part of a surrogate pair and interpreting the character appropriately if it is.

@tannergooding tannergooding added Bug Language-C# Area-Compilers Tenet-Localization Some piece of UI isn’t localized, often due to hard-coding of strings or other visible elements. labels Mar 14, 2016
@tannergooding
Copy link
Member Author

FYI. @jaredpar, @dotnet/roslyn-compiler

Also, FYI @brettfo who helped discover the issue

@AlekseyTs
Copy link
Contributor

Is this a regression?

@tannergooding
Copy link
Member Author

@AlekseyTs, not a regression. It repro's back to the v2.0 native compiler. I don't expect this is a priority (especially since it hasn't been reported or discovered outside a random lunch discussion).

@jaredpar jaredpar added this to the Unknown milestone Mar 14, 2016
@jaredpar jaredpar added the help wanted The issue is "up for grabs" - add a comment if you are interested in working on it label Mar 14, 2016
@miloush
Copy link

miloush commented Mar 15, 2016

I did run into this issue in the past (with different characters). The reason I did not push it through was because the C# specification (version 5.0 anyway) actually says

For information on the Unicode character classes mentioned above, see The Unicode Standard, Version 3.0, section 4.5.

That is the only Unicode standard mentioned in references. Your character was not added to Unicode until 5.0. In fact, there were no characters above 0xFFFF in version 3.0.

On the other hand,

Since C# uses a 16-bit encoding of Unicode code points in characters and string values, a Unicode character in the range U+10000 to U+10FFFF is not permitted in a character literal and is represented using a Unicode surrogate pair in a string literal.

does not say anything about identifiers. And while it might be implied that 𒅴 is perhaps not allowed explicitly, the way I read the specification suggests that int \U00012174; should be allowed which isn't.

By the way, Some characters have been reclassified to different categories over time, it is not clear to me whether the specification suggests whether the state at version 3.0 must be used.

@ufcpp
Copy link
Contributor

ufcpp commented Mar 16, 2016

@miloush
https://github.com/dotnet/roslyn/blob/master/docs/compilers/CSharp/Unicode%20Version.md

The Roslyn compilers depend on the underlying platform for their Unicode behavior. As a practical matter, that means that the new compiler will reflect changes in the Unicode standard.

@miloush
Copy link

miloush commented Mar 30, 2016

@ufcpp it would be nonsense to do anything else, I am just not sure whether that's what the language specification suggests; might be worth updating it.

@gafter
Copy link
Member

gafter commented Sep 2, 2016

This is also reported as #13474

@gafter
Copy link
Member

gafter commented Jul 3, 2018

See also #13474 and #13560

@gafter gafter moved this from Projects to In Review in Compiler: Gafter Jul 4, 2018
@gafter
Copy link
Member

gafter commented Jul 4, 2018

The language specification for C# (both the online version and the ECMA version) permit characters larger than 16 bits in identifiers, either using a Unicode escape sequence or a surrogate pair of code points. It is a bug that the compiler does not accept them.

@gafter gafter moved this from In Review to Next Up in Compiler: Gafter Jul 6, 2018
@gafter
Copy link
Member

gafter commented Jul 6, 2018

In order to fix this for VB too, we'd need an API to lowercase a UTF32 code point. See https://github.com/dotnet/corefx/issues/30879

@gafter
Copy link
Member

gafter commented Dec 11, 2018

@jaredpar
Copy link
Member

.NET Core only so can't that API for a while.

@gafter
Copy link
Member

gafter commented Dec 11, 2018

@jaredpar I expect we'd use the code that Stephen suggested in the shared library in C# until that someday, if those APIs are available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compilers Bug help wanted The issue is "up for grabs" - add a comment if you are interested in working on it Language-C# Tenet-Localization Some piece of UI isn’t localized, often due to hard-coding of strings or other visible elements.
Projects
None yet
Development

No branches or pull requests

6 participants