Unicode tables used for checking identifiers seem to be very out of date #44284

Serentty · 2020-05-15T07:12:31Z

According to the specification here, many characters which the compiler currently does not allow in identifiers should be allowed, as they have the necessary Unicode properties to qualify. At first I suspected that this might be an issue with characters falling outside of the BMP (which could theoretically also be another issue), but it seems that even many characters within the BMP such as 鿭 (the simplified Chinese character for the element nihonium) which should qualify do not work, since they were added more recently than whatever tables the compiler is using. The C# standard does direct the reader to the Unicode standard version 3.0 section 4.5 for information on character classes, but it does not specify that identifiers are to be limited to those characters which already existed in version 3.0, or that C# is fixed to Unicode 3.0 for character class data.

svick · 2020-05-15T13:57:03Z

At first I suspected that this might be an issue with characters falling outside of the BMP

Yes, that is an issue, and it's already tracked by #13474 and #9731.

it seems that even many characters within the BMP such as 鿭 (the simplified Chinese character for the element nihonium) which should qualify do not work, since they were added more recently than whatever tables the compiler is using

The C# compiler uses character tables from the framework which it is running on. And since character tables in .Net Framework seem to be fairly outdated, while recent versions of .Net Core use newer tables, whether U+9FED is accepted in an identifier depends on the way you run the compiler.

For example, consider the following program:

class \u9FED
{
    static void Main() {}
}

When I compile it using the .Net Core SDK 3.1 (which uses .Net Core 3.1), it works fine:

> dotnet build
Microsoft (R) Build Engine version 16.5.0+d4cbfca49 for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.

  Restore completed in 71,03 ms for C:\code\tmp\hwapp\hwapp.csproj.
  hwapp -> C:\code\tmp\hwapp\bin\Debug\netcoreapp1.1\hwapp.dll

Build succeeded.
    0 Warning(s)
    0 Error(s)

But if I use .Net Core SDK version 2.2 instead (by editing global.json), it doesn't work:

> dotnet build      
Microsoft (R) Build Engine version 16.2.32702+c4012a063 for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.

  Restore completed in 969,12 ms for C:\code\tmp\hwapp\hwapp.csproj.
Program.cs(1,7): error CS1001: Identifier expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1514: { expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1513: } expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1056: Unexpected character '\u9FED' [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(2,1): error CS1022: Type or namespace definition, or end-of-file expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(4,1): error CS1022: Type or namespace definition, or end-of-file expected [C:\code\tmp\hwapp\hwapp.csproj]

Build FAILED.

Program.cs(1,7): error CS1001: Identifier expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1514: { expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1513: } expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1056: Unexpected character '\u9FED' [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(2,1): error CS1022: Type or namespace definition, or end-of-file expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(4,1): error CS1022: Type or namespace definition, or end-of-file expected [C:\code\tmp\hwapp\hwapp.csproj]
    0 Warning(s)
    6 Error(s)

Time Elapsed 00:00:03.58

It also doesn't work if I use the .Net Framework version of MSBuild:

>msbuild /v:m
Microsoft (R) Build Engine version 16.5.0+d4cbfca49 for .NET Framework
Copyright (C) Microsoft Corporation. All rights reserved.

Program.cs(1,7): error CS1001: Identifier expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1514: { expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1513: } expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1056: Unexpected character '\u9FED' [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(2,1): error CS1022: Type or namespace definition, or end-of-file expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(4,1): error CS1022: Type or namespace definition, or end-of-file expected [C:\code\tmp\hwapp\hwapp.csproj]

How are you running the compiler? Is possible for you to switch to running it on a recent version of .Net Core?

Serentty · 2020-05-15T14:50:04Z

That's strange. I was using the 5.0 preview. I was using the character directly in the source code instead of the escape syntax that you used in that example, though. Would that make a difference?

ufcpp · 2020-05-15T15:14:44Z

Roslyn uses latest Unicode table (Unicode 12.0 in .NET Core 3.1) but only BMP characters.

I'm working on the fix:
ufcpp@59df36d

changing GetUnicodeCategory(char) to GetUnicodeCategory(int) to support non-BMP characters.

svick · 2020-05-15T15:16:00Z

I was using the 5.0 preview.

How exactly are you compiling your code? Using dotnet build? Or from Visual Studio?

When I use VS 16.6 Preview 5, it doesn't work. I believe this is because VS is a .Net Framework app, and so it runs the compiler on .Net Framework, which means it uses outdated character tables.

I was using the character directly in the source code instead of the escape syntax that you used in that example, though. Would that make a difference?

It shouldn't.

Serentty · 2020-05-15T15:34:59Z

Oh, this line actually builds now! I had to retype it for some reason. Maybe I had some lingering surrogate halves somewhere in the code (outside of the string literal of course).

var 鿭 = "𗼇𗟲";

Yes, I was compiling using dotnet build. It also seems like the C# VS Code plugin rejects that identifier even though the compiler will accept it, so that probably also misled me a bit.

Roslyn uses latest Unicode table (Unicode 12.0 in .NET Core 3.1) but only BMP characters.

Oh nice, so it will probably be 13.0 in 5.0 then.

I'm working on the fix:

That's really good to hear! I'm really happy to see the Unicode situation improving. I'm guessing you're using int instead of Rune so that it can detect unpaired surrogates, right?

gafter · 2020-05-18T03:06:27Z

Is this issue actionable at this point?

Serentty · 2020-05-18T19:04:03Z

It seems to me now like the issue, if there is an issue at all, is with the language server and/or the VS Code plugin, since only it complains about that identifier, not the actual compiler.

Serentty · 2020-05-18T19:06:45Z

Yep, I can confirm that I still get a red squiggly and an error about an unexpected character in VS Code. I'm not sure if the language server is in this repository or another one.

svick · 2020-05-18T22:04:12Z

@Serentty The VS Code plugin for C# uses OmniSharp, whose server part runs on .Net Framework, which would explain why it uses old character tables. There's an issue with more details about why OmniSharp uses .Net Framework at OmniSharp/omnisharp-roslyn#1703.

gafter · 2020-05-18T23:58:03Z

Closing, as I expect The VS Code plugin for C# is where that will have to be fixed, and that is being tracked at OmniSharp/omnisharp-roslyn#1703

Serentty · 2020-05-19T00:44:19Z

That makes sense! Thanks for figuring out why this happens.

gafter added the Area-Compilers label May 18, 2020

gafter closed this as completed May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode tables used for checking identifiers seem to be very out of date #44284

Unicode tables used for checking identifiers seem to be very out of date #44284

Serentty commented May 15, 2020

svick commented May 15, 2020

Serentty commented May 15, 2020

ufcpp commented May 15, 2020

svick commented May 15, 2020

Serentty commented May 15, 2020

gafter commented May 18, 2020

Serentty commented May 18, 2020

Serentty commented May 18, 2020

svick commented May 18, 2020

gafter commented May 18, 2020

Serentty commented May 19, 2020

Unicode tables used for checking identifiers seem to be very out of date #44284

Unicode tables used for checking identifiers seem to be very out of date #44284

Comments

Serentty commented May 15, 2020

svick commented May 15, 2020

Serentty commented May 15, 2020

ufcpp commented May 15, 2020

svick commented May 15, 2020

Serentty commented May 15, 2020

gafter commented May 18, 2020

Serentty commented May 18, 2020

Serentty commented May 18, 2020

svick commented May 18, 2020

gafter commented May 18, 2020

Serentty commented May 19, 2020