Compiler does not correctly interpret surrogate pairs when used in an identifier #9731

tannergooding · 2016-03-14T19:52:27Z

The C# specification states that an identifier can start with or contain anything matching letter-character, which is defined as:

letter-character::
A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
A unicode-escape-sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl

However, the compiler does not appear to correctly interpret some characters which match the above categories if they are part of a surrogate pair.

For example the sumerian character 𒅴 is categorized as 'OtherLetter' (matching 'Lo' above) when processed through char.GetUnicodeCategory("𒅴", 0).

However, the compiler is interpreting this character as two separate characters (and reporting CS1056 for both). It is likely checking each character individually, rather than checking if the first character is part of a surrogate pair and interpreting the character appropriately if it is.

The text was updated successfully, but these errors were encountered:

tannergooding · 2016-03-14T19:52:53Z

FYI. @jaredpar, @dotnet/roslyn-compiler

Also, FYI @brettfo who helped discover the issue

AlekseyTs · 2016-03-14T19:58:18Z

Is this a regression?

tannergooding · 2016-03-14T20:01:35Z

@AlekseyTs, not a regression. It repro's back to the v2.0 native compiler. I don't expect this is a priority (especially since it hasn't been reported or discovered outside a random lunch discussion).

miloush · 2016-03-15T10:51:29Z

I did run into this issue in the past (with different characters). The reason I did not push it through was because the C# specification (version 5.0 anyway) actually says

For information on the Unicode character classes mentioned above, see The Unicode Standard, Version 3.0, section 4.5.

That is the only Unicode standard mentioned in references. Your character was not added to Unicode until 5.0. In fact, there were no characters above 0xFFFF in version 3.0.

On the other hand,

Since C# uses a 16-bit encoding of Unicode code points in characters and string values, a Unicode character in the range U+10000 to U+10FFFF is not permitted in a character literal and is represented using a Unicode surrogate pair in a string literal.

does not say anything about identifiers. And while it might be implied that 𒅴 is perhaps not allowed explicitly, the way I read the specification suggests that int \U00012174; should be allowed which isn't.

By the way, Some characters have been reclassified to different categories over time, it is not clear to me whether the specification suggests whether the state at version 3.0 must be used.

ufcpp · 2016-03-16T02:03:34Z

@miloush
https://github.com/dotnet/roslyn/blob/master/docs/compilers/CSharp/Unicode%20Version.md

The Roslyn compilers depend on the underlying platform for their Unicode behavior. As a practical matter, that means that the new compiler will reflect changes in the Unicode standard.

miloush · 2016-03-30T10:13:25Z

@ufcpp it would be nonsense to do anything else, I am just not sure whether that's what the language specification suggests; might be worth updating it.

gafter · 2016-09-02T17:20:07Z

This is also reported as #13474

gafter · 2018-07-03T19:19:23Z

See also #13474 and #13560

gafter · 2018-07-04T00:53:07Z

The language specification for C# (both the online version and the ECMA version) permit characters larger than 16 bits in identifiers, either using a Unicode escape sequence or a surrogate pair of code points. It is a bug that the compiler does not accept them.

gafter · 2018-07-06T17:44:58Z

In order to fix this for VB too, we'd need an API to lowercase a UTF32 code point. See https://github.com/dotnet/corefx/issues/30879

gafter · 2018-12-11T01:34:42Z

The needed API has been added: see https://github.com/dotnet/corefx/issues/30879#issuecomment-445990453 and https://apisof.net/catalog/System.Text.Rune.ToLowerInvariant(Rune)

jaredpar · 2018-12-11T04:30:26Z

.NET Core only so can't that API for a while.

gafter · 2018-12-11T05:14:03Z

@jaredpar I expect we'd use the code that Stephen suggested in the shared library in C# until that someday, if those APIs are available.

tannergooding added Bug Language-C# Area-Compilers Tenet-Localization Some piece of UI isn’t localized, often due to hard-coding of strings or other visible elements. labels Mar 14, 2016

jaredpar added this to the Unknown milestone Mar 14, 2016

jaredpar added the help wanted The issue is "up for grabs" - add a comment if you are interested in working on it label Mar 14, 2016

This was referenced Jul 3, 2018

Roslyn rejects \U unicode characters in identifiers #13560

Open

csc doesn't accept tokens that are in the Ll class. #27986

Closed

gafter added this to Projects in Compiler: Gafter Jul 3, 2018

gafter moved this from Projects to In Review in Compiler: Gafter Jul 4, 2018

gafter moved this from In Review to Next Up in Compiler: Gafter Jul 6, 2018

gafter mentioned this issue Jul 26, 2018

Champion: "Permit surrogate pairs and wide Unicode-escaped code points in identifiers" dotnet/csharplang#1742

Open

5 tasks

gafter removed this from Next Up in Compiler: Gafter Aug 10, 2018

gafter mentioned this issue Jan 31, 2020

Please add APIs to determine the Unicode category of a UTF32 code point dotnet/runtime#26719

Closed

svick mentioned this issue May 15, 2020

Unicode tables used for checking identifiers seem to be very out of date #44284

Closed

gafter self-assigned this Nov 1, 2020

srutzky mentioned this issue Apr 30, 2021

Correctify Identifier definition to conform to Unicode standard in "Lexical structure" dotnet/csharpstandard#305

Open

gafter mentioned this issue May 11, 2022

Surrogate pairs not recognized in identifiers. #13474

Closed

tannergooding mentioned this issue Jul 15, 2023

Compiler doesn't tokenize surrogate pairs correctly #69041

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiler does not correctly interpret surrogate pairs when used in an identifier #9731

Compiler does not correctly interpret surrogate pairs when used in an identifier #9731

tannergooding commented Mar 14, 2016

tannergooding commented Mar 14, 2016

AlekseyTs commented Mar 14, 2016

tannergooding commented Mar 14, 2016

miloush commented Mar 15, 2016

ufcpp commented Mar 16, 2016

miloush commented Mar 30, 2016

gafter commented Sep 2, 2016

gafter commented Jul 3, 2018

gafter commented Jul 4, 2018

gafter commented Jul 6, 2018

gafter commented Dec 11, 2018

jaredpar commented Dec 11, 2018

gafter commented Dec 11, 2018 •

edited

Loading

Compiler does not correctly interpret surrogate pairs when used in an identifier #9731

Compiler does not correctly interpret surrogate pairs when used in an identifier #9731

Comments

tannergooding commented Mar 14, 2016

tannergooding commented Mar 14, 2016

AlekseyTs commented Mar 14, 2016

tannergooding commented Mar 14, 2016

miloush commented Mar 15, 2016

ufcpp commented Mar 16, 2016

miloush commented Mar 30, 2016

gafter commented Sep 2, 2016

gafter commented Jul 3, 2018

gafter commented Jul 4, 2018

gafter commented Jul 6, 2018

gafter commented Dec 11, 2018

jaredpar commented Dec 11, 2018

gafter commented Dec 11, 2018 • edited Loading

gafter commented Dec 11, 2018 •

edited

Loading