.NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences #13547

GrabYourPitchforks · 2019-07-24T22:52:12Z

.NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences

See .NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences for updated documentation for this change.

When the UTF8Encoding class encounters an ill-formed UTF-8 byte sequence during a bytes-to-chars transcoding operation, it will replace that sequence with a '�' (U+FFFD REPLACEMENT CHARACTER) character in the output string. .NET Core 3.0 differs from previous versions of .NET Core and the .NET Framework in that .NET Core 3.0 follows the Unicode best practice for performing this replacement during the transcoding operation.

Version introduced

3.0

Change description

When transcoding bytes to chars, the UTF8Encoding class now performs character substitution based on Unicode best practices. The substitution mechanism used is described by The Unicode Standard, Version 12.0, Sec. 3.9 (PDF) in the heading titled U+FFFD Substitution of Maximal Subparts.

This behavior only applies when the input byte sequence contains ill-formed UTF-8 data. Additionally, if the UTF8Encoding instance has been constructed with throwOnInvalidBytes: true (see the ctor documentation), the UTF8Encoding instance will continue to throw on invalid input rather than perform U+FFFD replacement.

Old behavior

Input: The 3-byte input: [ ED A0 90 ] (ill-formed input)
Output: The 2-char output: [ FFFD FFFD ]

New behavior

Input: The 3-byte input: [ ED A0 90 ] (ill-formed input)
Output: The 3-char output: [ FFFD FFFD FFFD ]

(This 3-char output is the preferred output per Table 3-9 of the previously linked Unicode Standard PDF.)

Reason for change

This is part of a larger effort to improve UTF-8 handling throughout .NET, including by the new System.Text.Unicode.Utf8 and System.Text.Rune types. The UTF8Encoding type was given improved error handling mechanics so that it produces output consistent with the newly introduced types.

Recommended action

No action is required on the part of the developer.

Affected APIs

Issue metadata

Issue type: breaking-change

The text was updated successfully, but these errors were encountered:

rpetrusha · 2019-08-06T00:47:37Z

I've made some minor changes to the text of the issue, @GrabYourPitchforks. Please review them.

rpetrusha · 2019-09-11T20:39:30Z

@GrabYourPitchforks, do we need to note the preview version in which this change was introduced?

dotnet-bot added breaking-change Indicates a .NET Core breaking change ⌚ Not Triaged Not triaged labels Jul 24, 2019

rpetrusha removed the ⌚ Not Triaged Not triaged label Aug 6, 2019

rpetrusha closed this as completed Sep 11, 2019

svick mentioned this issue Jan 11, 2020

UTF8 Encoding isn't consistent with .Net Framework dotnet/standard#1679

Closed

GrabYourPitchforks mentioned this issue Apr 15, 2021

Breaking change proposal: Encoding.UTF8 singleton should not have a BOM dotnet/runtime#51353

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences #13547

.NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences #13547

GrabYourPitchforks commented Jul 24, 2019 •

edited by rpetrusha

rpetrusha commented Aug 6, 2019

rpetrusha commented Sep 11, 2019

.NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences #13547

.NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences #13547

Comments

GrabYourPitchforks commented Jul 24, 2019 • edited by rpetrusha