Proposal: Make Console.Input/OutputEncoding default to UTF-16 on Windows #70168

huoyaoyuan · 2022-06-02T19:08:53Z

Background

Currently, System.Console calls GetConsoleCP on Windows to get console encoding, which has caused enormous problems:

Characters not in current code page can be displayed/inputted in console, under default setting:

Without explicitly specifying Encoding.Unicode, the console can't display emoji (via Windows Terminal), or some other script not represented. (On Windows-1252 system it should not be able to display Chinese).
Characters are frequently transcoded in wrong way, and get garbled.
Referring to C# Interactive is broken in VS16.8 preview5 roslyn#48874. I'm pretty annoyed too like the person in that thread.
It's also garbling with latest dotnet SDK. The issue is newly happened with SDK update within this month (May).

Proposal

ANSI codepages are totally legacy. We should totally get rid of it, and use some variant of Unicode anywhere.
The internal encoding of Windows NT is UTF-16, the same of .NET. We can also safe the time of transcoding from UTF-16 to code page then to UTF-16 again.

This would be a breaking change for ones who operates with Console.OpenStandardXXX and redirected IO, which can be addressed by setting console encoding in program entry point. We may also add a compatibility switch for this. For ASCII interoperability, we should suggest setting the encoding to UTF-8.

Additionally, setting default encoding to UTF-16 should also indicates encoding problems when using English only. Since most code pages including UTF-8 shares the ASCII range, English text always gets outputted correctly under misconfigured encoding. Since most of the development is under English, encoding problems get kept silently.

Additional words

I'd really want you to treat encoding problem as severe bug. It's never a problem for English users, but has frustrated other language users for decades, since the start of multi-language Windows. Fixing such problem in minor release of VS instead of patch release is unacceptable for me, as well as other Chinese users.
Multi-byte encoding system gets more pain from non-coding elements. Characters from wrong encoding will appear as broken mult-byte sequence (#69781).
There is Spanish build in roslyn CI. Can we add a CI leg to verify the runtime builds (and test runs?) correctly on non-English system?

The text was updated successfully, but these errors were encountered:

ghost · 2022-06-02T19:08:59Z

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

Background

Currently, System.Console calls GetConsoleCP on Windows to get console encoding, which has caused enormous problems:

Characters not in current code page can be displayed/inputted in console, under default setting:

Without explicitly specifying Encoding.Unicode, the console can't display emoji (via Windows Terminal), or some other script not represented. (On Windows-1252 system it should not be able to display Chinese).
Characters are frequently transcoded in wrong way, and get garbled.
Referring to C# Interactive is broken in VS16.8 preview5 roslyn#48874. I'm pretty annoyed too like the person in that thread.
It's also garbling with latest dotnet SDK. The issue is newly happened with SDK update within this month (May).

Proposal

ANSI codepages are totally legacy. We should totally get rid of it, and use some variant of Unicode anywhere.
The internal encoding of Windows NT is UTF-16, the same of .NET. We can also safe the time of transcoding from UTF-16 to code page then to UTF-16 again.

This would be a breaking change for ones who operates with Console.OpenStandardXXX and redirected IO, which can be addressed by setting console encoding in program entry point. We may also add a compatibility switch for this. For ASCII interoperability, we should suggest setting the encoding to UTF-8.

Additionally, setting default encoding to UTF-16 should also indicates encoding problems when using English only. Since most code pages including UTF-8 shares the ASCII range, English text always gets outputted correctly under misconfigured encoding. Since most of the development is under English, encoding problems get kept silently.

Additional words

I'd really want you to treat encoding problem as severe bug. It's never a problem for English users, but has frustrated other language users for decades, since the start of multi-language Windows. Fixing such problem in minor release of VS instead of patch release is unacceptable for me, as well as other Chinese users.
Multi-byte encoding system gets more pain from non-coding elements. Characters from wrong encoding will appear as broken mult-byte sequence (#69781).
There is Spanish build in roslyn CI. Can we add a CI leg to verify the runtime builds (and test runs?) correctly on non-English system?

Author:	huoyaoyuan
Assignees:	-
Labels:	`area-System.Text.Encoding`, `untriaged`
Milestone:	-

tarekgh · 2022-06-02T19:43:34Z

On Windows you can set the default codepage to UTF-8 and this will reflect on all .NET applications. You can do that by running intl.cpl then click on the Administrative tab, then click on Change system locale... button, then check the box labeled with Beta: Use Unicode UTF-8 for worldwide language support.

on non-Windows platforms, mostly the terminals already are using UTF-8 encoding.

huoyaoyuan · 2022-06-03T04:46:36Z

On Windows you can set the default codepage to UTF-8 and this will reflect on all .NET applications.

I know this option. Unfortunately, there's still tons of encoding issues under this, either existing or newly introduced. This option doesn't solve any issue at all.
Affecting all applications is not an option either. This would affect more applications than .NET, and many application won't handle this well.

Using UTF-16 has more benefit that consoles are operated using W variant of console API, instead of file API.

runtime/src/libraries/System.Console/src/System/ConsolePal.Windows.cs

Lines 1193 to 1204 in 45589f2

    
           else 
        
           { 
        
               // If the code page could be Unicode, we should use ReadConsole instead, e.g. 
        
               // Note that WriteConsoleW has a max limit on num of chars to write (64K) 
        
               // [https://docs.microsoft.com/en-us/windows/console/writeconsole] 
        
               // However, we do not need to worry about that because the StreamWriter in Console has 
        
               // a much shorter buffer size anyway. 
        
               int charsWritten; 
        
               writeSuccess = Interop.Kernel32.WriteConsole(hFile, p, bytes.Length / BytesPerWChar, out charsWritten, IntPtr.Zero); 
        
               Debug.Assert(!writeSuccess || bytes.Length / BytesPerWChar == charsWritten); 
        
           }

davidfowl · 2022-06-03T05:07:39Z

Is this a mega breaking change?

ufcpp · 2022-06-03T06:30:28Z

I have the same problem with CP932. However I want default is UTF-8 instead of UTF-16.

https://developercommunity.visualstudio.com/t/%E3%83%87%E3%83%90%E3%83%83%E3%82%B0%E5%AE%9F%E8%A1%8C%E3%81%A7Shift_JIS%E3%81%AB%E3%81%AA%E3%81%84%E6%96%87%E5%AD%97%E3%81%8C%E8%A1%A8%E7%A4%BA%E3%81%95%E3%82%8C%E3%81%BE%E3%81%9B%E3%82%93/10001821?port=1026&fsid=e89be1c0-48c2-403a-a7b8-203df6f715ef&entry=myfeedback&ref=native&refTime=1654237637514&refUserId=503a37f9-893c-464f-a313-193fe5747f8a

ufcpp · 2022-06-03T06:36:27Z

GetConsoleCP is OK. What I want is just running F5 Debug Console with CP 65001.

huoyaoyuan · 2022-06-03T07:24:24Z

Is this a mega breaking change?

In fact I don't know. It also depends on how Windows handles the relationship between console file and the console APIs.

In other words, I want to switch to WriteConsoleW to the default, instead of current WriteFile.

huoyaoyuan · 2022-06-03T10:56:10Z

I did some test with redirecting:

The > operator of cmd (native redirect) will write the output to file as-is, under specified encoding.
The > operator of PowerShell always read as current system encoding and write as UTF-8. It does not react with application changing its console encoding. Anything not in system encoding will be garbled.

There is no magic happened. Both side of the pipe need to get agreement about the encoding. Changing default to UTF-16 would break a lot, since UTF-16 isn't widely used as file or communication encoding.

The current behavior is far from ideal. With observing PowerShell garbling things, I understand how encoding issue happens.
We should consider to change default to UTF-8.

huoyaoyuan · 2022-06-03T15:26:19Z

Today I read at OldNewThing that the default encoding can be set to UTF-8 through manifest. Although we don't own the manifest for any binaries, we can consider to set this property in default template. Setting this on the default dotnet.exe could be breaking though.

tarekgh · 2022-06-03T15:51:48Z

Is this a mega breaking change?

Yes, it is a big breaking change. Windows didn't make this option as a default and marking it as Beta for a while now. It is not something we need to risk and enable by default.

ghost · 2022-08-02T08:52:20Z

Tagging subscribers to this area: @dotnet/area-system-console
See info in area-owners.md if you want to be subscribed.

Issue Details

Background

Currently, System.Console calls GetConsoleCP on Windows to get console encoding, which has caused enormous problems:

Characters not in current code page can be displayed/inputted in console, under default setting:

Without explicitly specifying Encoding.Unicode, the console can't display emoji (via Windows Terminal), or some other script not represented. (On Windows-1252 system it should not be able to display Chinese).
Characters are frequently transcoded in wrong way, and get garbled.
Referring to C# Interactive is broken in VS16.8 preview5 roslyn#48874. I'm pretty annoyed too like the person in that thread.
It's also garbling with latest dotnet SDK. The issue is newly happened with SDK update within this month (May).

Proposal

ANSI codepages are totally legacy. We should totally get rid of it, and use some variant of Unicode anywhere.
The internal encoding of Windows NT is UTF-16, the same of .NET. We can also safe the time of transcoding from UTF-16 to code page then to UTF-16 again.

This would be a breaking change for ones who operates with Console.OpenStandardXXX and redirected IO, which can be addressed by setting console encoding in program entry point. We may also add a compatibility switch for this. For ASCII interoperability, we should suggest setting the encoding to UTF-8.

Additionally, setting default encoding to UTF-16 should also indicates encoding problems when using English only. Since most code pages including UTF-8 shares the ASCII range, English text always gets outputted correctly under misconfigured encoding. Since most of the development is under English, encoding problems get kept silently.

Additional words

I'd really want you to treat encoding problem as severe bug. It's never a problem for English users, but has frustrated other language users for decades, since the start of multi-language Windows. Fixing such problem in minor release of VS instead of patch release is unacceptable for me, as well as other Chinese users.
Multi-byte encoding system gets more pain from non-coding elements. Characters from wrong encoding will appear as broken mult-byte sequence (#69781).
There is Spanish build in roslyn CI. Can we add a CI leg to verify the runtime builds (and test runs?) correctly on non-English system?

Author:	huoyaoyuan
Assignees:	-
Labels:	`area-System.Console`
Milestone:	Future

adamsitnik · 2023-11-14T07:19:24Z

Closing as a duplicate of #31466.

dotnet-issue-labeler bot added the area-System.Text.Encoding label Jun 2, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Jun 2, 2022

jeffhandley added this to the Future milestone Aug 2, 2022

ghost removed the untriaged New issue has not been triaged by the area owner label Aug 2, 2022

jeffhandley added area-System.Console and removed area-System.Text.Encoding labels Aug 2, 2022

adamsitnik closed this as not planned Won't fix, can't repro, duplicate, stale Nov 14, 2023

github-actions bot locked and limited conversation to collaborators Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Make Console.Input/OutputEncoding default to UTF-16 on Windows #70168

Proposal: Make Console.Input/OutputEncoding default to UTF-16 on Windows #70168

huoyaoyuan commented Jun 2, 2022

ghost commented Jun 2, 2022

Background

Proposal

Additional words

tarekgh commented Jun 2, 2022 •

edited

Loading

huoyaoyuan commented Jun 3, 2022

davidfowl commented Jun 3, 2022

ufcpp commented Jun 3, 2022 •

edited

Loading

ufcpp commented Jun 3, 2022

huoyaoyuan commented Jun 3, 2022

huoyaoyuan commented Jun 3, 2022

huoyaoyuan commented Jun 3, 2022

tarekgh commented Jun 3, 2022

ghost commented Aug 2, 2022

Background

Proposal

Additional words

adamsitnik commented Nov 14, 2023

Proposal: Make Console.Input/OutputEncoding default to UTF-16 on Windows #70168

Proposal: Make Console.Input/OutputEncoding default to UTF-16 on Windows #70168

Comments

huoyaoyuan commented Jun 2, 2022

Background

Proposal

Additional words

ghost commented Jun 2, 2022

Background

Proposal

Additional words

tarekgh commented Jun 2, 2022 • edited Loading

huoyaoyuan commented Jun 3, 2022

davidfowl commented Jun 3, 2022

ufcpp commented Jun 3, 2022 • edited Loading

ufcpp commented Jun 3, 2022

huoyaoyuan commented Jun 3, 2022

huoyaoyuan commented Jun 3, 2022

huoyaoyuan commented Jun 3, 2022

tarekgh commented Jun 3, 2022

ghost commented Aug 2, 2022

Background

Proposal

Additional words

adamsitnik commented Nov 14, 2023

tarekgh commented Jun 2, 2022 •

edited

Loading

ufcpp commented Jun 3, 2022 •

edited

Loading