Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode tables used for checking identifiers seem to be very out of date #44284

Closed
Serentty opened this issue May 15, 2020 · 11 comments
Closed

Comments

@Serentty
Copy link

According to the specification here, many characters which the compiler currently does not allow in identifiers should be allowed, as they have the necessary Unicode properties to qualify. At first I suspected that this might be an issue with characters falling outside of the BMP (which could theoretically also be another issue), but it seems that even many characters within the BMP such as 鿭 (the simplified Chinese character for the element nihonium) which should qualify do not work, since they were added more recently than whatever tables the compiler is using. The C# standard does direct the reader to the Unicode standard version 3.0 section 4.5 for information on character classes, but it does not specify that identifiers are to be limited to those characters which already existed in version 3.0, or that C# is fixed to Unicode 3.0 for character class data.

@svick
Copy link
Contributor

svick commented May 15, 2020

At first I suspected that this might be an issue with characters falling outside of the BMP

Yes, that is an issue, and it's already tracked by #13474 and #9731.

it seems that even many characters within the BMP such as 鿭 (the simplified Chinese character for the element nihonium) which should qualify do not work, since they were added more recently than whatever tables the compiler is using

The C# compiler uses character tables from the framework which it is running on. And since character tables in .Net Framework seem to be fairly outdated, while recent versions of .Net Core use newer tables, whether U+9FED is accepted in an identifier depends on the way you run the compiler.

For example, consider the following program:

class \u9FED
{
    static void Main() {}
}

When I compile it using the .Net Core SDK 3.1 (which uses .Net Core 3.1), it works fine:

> dotnet build
Microsoft (R) Build Engine version 16.5.0+d4cbfca49 for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.

  Restore completed in 71,03 ms for C:\code\tmp\hwapp\hwapp.csproj.
  hwapp -> C:\code\tmp\hwapp\bin\Debug\netcoreapp1.1\hwapp.dll

Build succeeded.
    0 Warning(s)
    0 Error(s)

But if I use .Net Core SDK version 2.2 instead (by editing global.json), it doesn't work:

> dotnet build      
Microsoft (R) Build Engine version 16.2.32702+c4012a063 for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.

  Restore completed in 969,12 ms for C:\code\tmp\hwapp\hwapp.csproj.
Program.cs(1,7): error CS1001: Identifier expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1514: { expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1513: } expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1056: Unexpected character '\u9FED' [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(2,1): error CS1022: Type or namespace definition, or end-of-file expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(4,1): error CS1022: Type or namespace definition, or end-of-file expected [C:\code\tmp\hwapp\hwapp.csproj]

Build FAILED.

Program.cs(1,7): error CS1001: Identifier expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1514: { expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1513: } expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1056: Unexpected character '\u9FED' [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(2,1): error CS1022: Type or namespace definition, or end-of-file expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(4,1): error CS1022: Type or namespace definition, or end-of-file expected [C:\code\tmp\hwapp\hwapp.csproj]
    0 Warning(s)
    6 Error(s)

Time Elapsed 00:00:03.58

It also doesn't work if I use the .Net Framework version of MSBuild:

>msbuild /v:m
Microsoft (R) Build Engine version 16.5.0+d4cbfca49 for .NET Framework
Copyright (C) Microsoft Corporation. All rights reserved.

Program.cs(1,7): error CS1001: Identifier expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1514: { expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1513: } expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(1,7): error CS1056: Unexpected character '\u9FED' [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(2,1): error CS1022: Type or namespace definition, or end-of-file expected [C:\code\tmp\hwapp\hwapp.csproj]
Program.cs(4,1): error CS1022: Type or namespace definition, or end-of-file expected [C:\code\tmp\hwapp\hwapp.csproj]

How are you running the compiler? Is possible for you to switch to running it on a recent version of .Net Core?

@Serentty
Copy link
Author

That's strange. I was using the 5.0 preview. I was using the character directly in the source code instead of the escape syntax that you used in that example, though. Would that make a difference?

@ufcpp
Copy link
Contributor

ufcpp commented May 15, 2020

Roslyn uses latest Unicode table (Unicode 12.0 in .NET Core 3.1) but only BMP characters.

I'm working on the fix:
ufcpp@59df36d

changing GetUnicodeCategory(char) to GetUnicodeCategory(int) to support non-BMP characters.

@svick
Copy link
Contributor

svick commented May 15, 2020

I was using the 5.0 preview.

How exactly are you compiling your code? Using dotnet build? Or from Visual Studio?

When I use VS 16.6 Preview 5, it doesn't work. I believe this is because VS is a .Net Framework app, and so it runs the compiler on .Net Framework, which means it uses outdated character tables.

I was using the character directly in the source code instead of the escape syntax that you used in that example, though. Would that make a difference?

It shouldn't.

@Serentty
Copy link
Author

Oh, this line actually builds now! I had to retype it for some reason. Maybe I had some lingering surrogate halves somewhere in the code (outside of the string literal of course).

var 鿭 = "𗼇𗟲";

Yes, I was compiling using dotnet build. It also seems like the C# VS Code plugin rejects that identifier even though the compiler will accept it, so that probably also misled me a bit.

Roslyn uses latest Unicode table (Unicode 12.0 in .NET Core 3.1) but only BMP characters.

Oh nice, so it will probably be 13.0 in 5.0 then.

I'm working on the fix:

That's really good to hear! I'm really happy to see the Unicode situation improving. I'm guessing you're using int instead of Rune so that it can detect unpaired surrogates, right?

@gafter
Copy link
Member

gafter commented May 18, 2020

Is this issue actionable at this point?

@Serentty
Copy link
Author

It seems to me now like the issue, if there is an issue at all, is with the language server and/or the VS Code plugin, since only it complains about that identifier, not the actual compiler.

@Serentty
Copy link
Author

Yep, I can confirm that I still get a red squiggly and an error about an unexpected character in VS Code. I'm not sure if the language server is in this repository or another one.

@svick
Copy link
Contributor

svick commented May 18, 2020

@Serentty The VS Code plugin for C# uses OmniSharp, whose server part runs on .Net Framework, which would explain why it uses old character tables. There's an issue with more details about why OmniSharp uses .Net Framework at OmniSharp/omnisharp-roslyn#1703.

@gafter
Copy link
Member

gafter commented May 18, 2020

Closing, as I expect The VS Code plugin for C# is where that will have to be fixed, and that is being tracked at OmniSharp/omnisharp-roslyn#1703

@gafter gafter closed this as completed May 18, 2020
@Serentty
Copy link
Author

That makes sense! Thanks for figuring out why this happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants